Feature construction and selection. Part 2

feature selection

We already know what features are and why they are important in machine learning models. We will try to understand the next technique: feature selection.

What is feature selection?

In the first article we dealt with the feature engineering. We have freshly generated features in our table, as well as some initial data. Now we have the important point: which ones to use? There are an infinite number of possible transformations.

“Running” all the features in a model to see which ones work is a bad idea. In fact, algorithms don’t work well when too many “features” get into them. How do you solve this problem? With feature selection.

Feature selection is the evaluation of the importance of a particular feature by using machine learning algorithms and the elimination of unnecessary features.

There are many algorithms that transform a dataset with too many features into a manageable subset. As with feature engineering, different methods are optimal for different types of data. When choosing an algorithm, we need to consider our goals. What we want to do with the processing dataset?

Choosing the best features out of the many available ones is not an easy task. A large number of them considerably increases the computation time. There is also the threat of overfitting.

There are several common methods for dealing with the problem, which fall into one of the several categories.

1. Filter methods

Filtering methods select the internal characteristic of the features. They are faster and less computationally expensive than shell ones. It is computationally cheaper to use filtering methods when dealing with high-dimensional data.

Information Gain (IG)

Calculates entropy reduction as a result of transforming a dataset. It can be used to select features by estimating the information gain of each variable in the context of the target variable.

import pandas as pd
import numpy as np
from sklearn.feature_selection import mutual_info_classif
import matplotlib.pyplot as plt
importances = mutual_info_classif(X, y)
# Where data is your dataset; X, y are input and output data
feature_importances = pd.Series(importances, data.columns[0:len(data.
columns)-1])
feature_importances.plot(kind='barh', color='teal')
plt.show()

Chi-square test

Chi-square test is used for categorical features in a dataset. We calculate the chi-square between each feature and the target. Then select the desired number of features with the best performance.

To properly apply the criterion to test relationship between different features in the dataset and the target variable, the following conditions are required: categorical variables that are chosen independently and frequency of values > 5.

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Convert to categorical data by using integers numbers
# X, y are input and output data
X_categorical = X.astype(int)
# Select 3 features with the highest chi-square
chi2_features = SelectKBest(chi2, k = 3)
X_kbest_features = chi2_features.fit_transform(X_categorical, y)
# The "before and after" output
print("Number of features before conversion:", X_categorical.shape[1])
print("Number of features after conversion:", X_kbest_features.shape[1])

Fisher’s test (F-test)

Fisher’s criterion is one of the most widely used methods of controlled feature selection. The algorithm returns ranks of the variables based on the criterion score in descending order, sorted by their selection.

import pandas as pd
import numpy as np
from skfeature.function.similarity_based import fisher_score
import matplotlib.pyplot as plt
# Calculate the criterion
# Where X, y are input and output data
ranks = fisher_score.fisher_score(X, y)
# Making a graph of our 'features'
# Where data is your dataset
feature_importances = pd.Series(ranks, data.columns[0:len(data.column
s)-1])
feature_importances.plot(kind='barh', color='teal')
plt.show()

Correlation coefficient

Correlation is a measure of the linear relationship between two or more variables. By using it we can predict one variable through another. The logic behind using this method for feature selection: “good” variables are highly correlated with our target.

Variables should correlate with the target, but should not with each other. In the example below, we will use Pearson correlation.

import seaborn as sns
import matplotlib.pyplot as plt
# Corelation matrix
# Where data - your dataset
correlation_matrix = data.corr()
# Displaying features on the heat map
plt.figure(figsize= (10, 6))
sns.heatmap(correlation_matrix, annot = True)

Mean Absolute Difference (MAD)

This technique allows us to calculate the absolute deviation from the mean.

import pandas as pd
import numpy as np
import matplotlib as plt
# Calculate MAD
# Where X - input data
mean_absolute_difference = np.sum(np.abs(X - np.mean(X, axis = 0)), a
xis = 0) / X.shape[0]
# Features plot
plt.bar(np.arange(X.shape[1]), mean_absolute_difference, color = 'tea
l')

2. Wrapper methods

The main advantage of these methods is to find all possible subsets of features and estimate their quality by “running” them through the model.

This type of feature selection process is based on the particular machine learning algorithm we use. It follows a greedy search algorithm, evaluating all possible feature combinations against a particular criterion. Envelope methods usually provide better prediction accuracy than filtering methods.

Direct feature selection

This is an extremely straightforward method in which we start with the most efficient variable. It’s related to the target. We then select another variable which gives better performance in combination with the first. This process continues until the target criterion is achieved.

from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector
lr = LogisticRegression(class_weight = 'balanced', solver = 'lbfgs',
random_state=42, n_jobs=-1, max_iter=50e)
ffs = SequentialFeatureSelector(lr, k_features='best', forward = True
, n_jobs=-1)
ffs.fit(X, Y)
# X, y - input and output data.
# X_train - input data from the training sample,
# y_pred - predictor output data
features = list(ffs.k_feature_names_)
features = list(map(int, features))
y_pred = lr.predict(X_train[features])

Sequential feature selection

This method works directly opposite to the direct feature selection method. We start with all available features and build a model. Then use the variable from the model that gives the best value of the evaluation measure. This process continues until a given criterion is reached.

from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector
lr = LogisticRegression(class_weight = 'balanced', solver = 'lbfgs',
random_state=42, n_jobs=-1, max_iter=50e)
lr.fit(X, y)
bfs = SequentialFeatureSelector(lr, k_features='best', forward = Fals
e, n_jobs=-1)
bfs.fit(X, y)
features = list(bfs.k_feature_names_)
features = list(map(int, features))
lr.fit(X_train[features], y_train)
y_pred = lr.predict(x_train[features])

Exhaustive feature selection

This is the most reliable feature selection method available. It evaluates each subset of features using a brute force method. This means that the method runs all possible combinations of features through the algorithm and returns the most efficient subset.

from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.ensemble import RandomForestClassifier
# create the ExhaustiveFeatureSeLlector object.
efs = ExhaustiveFeatureSelector(RandomForestClassifier(),
        min_features=4,
        max_features=8,
        scoring='roc_auc',
        cv=2)
efs = efs.fit(X, Y)
# output selected features
selected_features = X_train.columns[list(efs.best_idx_)]
print(selected_features)

Recursive feature exclusion

First, the model is trained on an initial set of features. The importance of each feature is determined either by the coef_ attribute or by the feature _importances_ attribute. Then the least important features are removed from the dataset. The procedure is recursively repeated for the reduced set until eventually the desired number of features for selection is achieved.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
lr = LogisticRegression(class_weight = 'balanced', solver = 'lbfgs',
random_state=42, n_jobs=-1, max_iter=50e)
rfe = RFE(lr, n_features_to_select=7)
rfe.fit(X_train, y_train)
# X_train, y_train - input and output data from the train sample
y_pred = rfe.predict(X_train)

3. Embedded methods

These methods incorporate the advantages of the first two and reduce computational overhead. An interesting moment of embedded methods is in the extraction of features in particular iteration.

LASSO regularization (L1)

Regularization consists in adding a “penalty” to various model parameters in order to avoid overfitting. When regularizing a linear model, penalty is applied to the coefficients multiplying each of the predictors. Lasso regularisation has the property of reducing some coefficients to zero. Consequently, such fits could simply be removed from the model.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
# Set our regularisation parameter C=1
logistic = LogisticRegression(C=1, penalty="l1", solver='liblinear',
random_state=7).fit(X, y)
# Where X, y are input and output data
model = SelectFromModel(logistic, prefit=True)
X_new = model.transform(X)
# Select the columns we want from the dataset without null data
# Where “selected_features” - our pre-selected (according to previous methods)
selected_columns = selected_features.columns[selected_features.var() 

Random Forest Importance method

The tree-based strategies used by random forests are naturally ranked by improving the purity of the model (in terms of data). Thus, with ‘pruning’ trees below a certain coefficient, we can pick out the most important features.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# create a random tree with your hyperparameters
model = RandomForestClassifier(n_estimators=340)
# Train the model on your sample; Where X, y are input and output data
model.fit(X, y)

# Selecting the most important features
importances = model.feature_importances_
# Create a separate dataset for visualisation

Conclusion

Effectively selecting the necessary features for a model leads to the greatest increase in performance. This is the problem that data scientists spend most of their time trying to solve. Of course, without feature engineering we have no material for feature selection.

Correct transformations depend on many factors: the type and structure of data, its amount etc. We should not forget about the resources available on our computer or in the cloud. By adopting both techniques in this article series, you will feel much more confident in the world of Data Science.

Related blog posts