Solving machine learning problems with a gradient boosting algorithm

Gradient Boosting is one of the most effective tools for solving machine learning problems, especially in Kaggle competitions. To learn how to apply it correctly, let’s take a closer look at the underlying processes.

Gradient Boosting is an advanced machine learning algorithm for solving classification and regression problems. It builds a prediction in the form of an ensemble of weak predictive models, which are basically decision trees. From several weak models, we eventually assemble one but already efficient. The general idea of the algorithm – apply the predictor sequentially in such a way that each successive model minimizes the error of the previous one.

Suppose you are playing golf. To drive the ball into the cup, you must swing the club based on the previous shot each time. That is, before a new shot, the golfer first looks at the distance between the ball and the holeĸa after the previous shot, as our main task is to reduce this distance at the next shot.

Boosting is constructed in the same way. First, we need to introduce the definition of “cup”, namely the goal, which is the end result of our efforts. Next, we need to understand where we need to “hit the club” in order to hit closer to the k target. With all these rules in mind, we need to make the right sequence of actions so that each successive shot will increase the distance between the ball and the moon.

It is worth noting that for classification and regression problems, the implementation of the algorithm in programming will differ.

Algorithm parameters

loss – error function to minimize.
criterion – criterion of splitting choice, Mean Absolute Error (MAE) or Mean Squared Error (MSE). Used only when building trees.
init – which algorithm we will use as the main one. That’s what the boosting technique improves.
learning_rate – learning speed.
n_estimators – number of iterations in boosting. The more iterations, the better the quality but too many of them can lead to performance degradation and overfitting.
min_samples_split – minimum number of objects, at which the splitting is performed. With this parameter we can avoid overfitting.
min_samples_leaf – minimum number of objects in a sheet (nodes). If you increase this parameter the quality of the model decreases, while the time it takes to build the model is reduced. Smaller values should be chosen for less balanced samples.
max_depth – maximal depth of the tree. It is used to exclude the possibility of overfitting.
max_features – number of features that are taken into consideration by the algorithm for building splits in the tree.
max_leaf_nodes – maximum number of top points in the tree. If this parameter is present, max_depth will be ignored.

Implementation in python (sklearn library)

import warnings
warnings.filterwarnings('ignore')
breast_cancer = load_breast_cancer()


### Target variable for our future model
X = pd.DataFrame(breast_cancer['data'], columns=breast_cancer['feature_names'])
y = pd.Categorical.from_codes(breast_cancer['target'], breast_cancer['target_names'])

lbl = LabelEncoder() 
lbl.fit(y)

y_enc = lbl.transform(y)


### Deal with features
scl = StandardScaler()
scl.fit(X)
X_scaled = scl.transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y_enc, test_size=0.20, random_state=42)


### Set the parameters for our model 
params = {'n_estimators':200,
          'max_depth':12,
          'criterion':'mse',
          'learning_rate':0.03,
          'min_samples_leaf':16,
          'min_samples_split':16
          }


### Training
gbr = GradientBoostingRegressor(**params)
gbr.fit(X_train,y_train)


### Calculate the accuracy
train_accuracy_score=gbr.score(X_train,y_train)
print(train_accuracy_score)

test_accuracy_score=gbr.score(X_test,y_test)
print(test_accuracy_score)

### Prediction
y_pred = gbr.predict(X_test)

### Mean Square Error
mse = mean_squared_error(y_test,y_pred)
print("MSE: %.2f" % mse)
print(r2_score(y_test,y_pred))

The result of the code:

0.9854271477118486
0.8728770740774442
MSE: 0.03
0.8728770740774442

A basic model of gradient boosting with a simple simple tweak gives us an accuracy of over 95% on a regression problem.

What libraries to use?

In addition to the classic sklearn for machine learning, there are three of the most used libraries for the gradient boosting algorithm:

XGBoost is a more regularised form of gradient binning. The main advantage of this library is performance and efficient optimization of computation (better results with less resources).

You can install XGBoost as follows:

pip install xgboost

The XGBoost library provides us with different classes for different tasks: XGBClassiRer for classification and XGBregressor for regression.

Note: All of the libraries below have separate classes for both classification and regression tasks.

Example of using XGBoost for classification:

# xgboost for classification problem
from numpy import asarray
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
# define the dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# and our model with cross-validation
model = XGBClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1
)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_
jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# train the model on the whole dataset
model = XGBClassifier()
model.fit(X, y)
# predict
row = [2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951,
 -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]
row = asarray(row).reshape((1, len(row)))
yhat = model.predict(row)
print('Predict: %d' % yhat[0])

Example of using XGBoost for regression:

# xgboost for regression
from numpy import asarray
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
# define the dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5
, random_state=1)
# and our model (here we change the metric to MAE)
model = XGBRegressor(objective='reg:squarederror')
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Точность: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# train the model on the whole data set
model = LGBMClassifier()
model.fit(X, y)
# predict
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Predict: %d' % yhat[0])

LightGBM is a library from Microsoft. It adds auto object selection and focus on those parts of the binning where we have a larger gradient. This contributes to faster learning of the model and better prediction performance. The main area of library are competitions using tabular data on Kaggle.

You can also install LightGBM using pip:

pip install lightgbm

LightGBM for classification:

# lightgbm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
# define the dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informati
ve=5, n_redundant=5, random_state=1)
# and our model
model = LGBMClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1
)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_

LightGBM for regression:

# lightgbm for regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from lightgbm import LGBMRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
# define the dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5
, random_state=1)
# and our model (here we change the metric to MAE)
model = LGBMRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_er
ror', cv=cv, n_jobs=-1, error_score=‘raise')
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# train the model on the whole data set
model = LGBMRegressor()
model.fit(X, y)
# predict
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])

CatBoost is a gradient boosting library created by Yandex. It uses decision trees (obliviosly) with which we create a balanced tree. The same functions are used to create splits at each level of the tree.

Moreover, the main advantage of CatBoost (besides improvement of speed of calculations) is support of categorical input variables. Because of this the library gets its name CatBoost, from “Category Gradient Boosting”. Not the cats.

You can install CatBoost in the previously tested way:

pip install catboost

CatBoost in the classification task:

# catboost for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from matplotlib import pyplot
# define the dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informati
ve=5, n_redundant=5, random_state=1)
# evaluate the model
model = CatBoostClassifier(verbose=0, n_estimators=100)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Точность: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# train the model on the whole data set
model = CatBoostClassifier(verbose=0, n_estimators=100)
model.fit(X, y)
# predict
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

CatBoost in a regression problem:

# catboost for regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from catboost import CatBoostRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from matplotlib import pyplot
# define the dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5 , random_state=1)
# and our model (here we change the metric to MAE)
model = CatBoostRegressor(verbose=0, n_estimators=100)
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# train the model on the whole data set
model = CatBoostRegressor(verbose=0, n_estimators=100)
model.fit(X, y)
# predict
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])

When to use?

You can use the gradient boosting algorithm under the following conditions:

The presence of a large number of observations (closest similarity) in the training data sample.
The number of features is smaller than the number of observations in the training data. Boosting works well when the data contain a mixture of numerical and categorical features or only numerical features.
When model performance metrics should be considered.

When XGBoost should NOT be used:

In image recognition and computer vision tasks.
In Natural Language Processing (NLP).
When the number of training samples is much smaller than the number of features.

Pros and cons

Pros:

Algorithm works with any loss functions.
Predictions are on average better than other algorithms.
Handles missing data independently.

Cons:

The algorithm is extremely sensitive to outliers and will spend a huge amount of resources on these points if they are present. However, it is worth noting that using Mean Absolute Error (MAE) instead of Mean Squared Error (MSE) significantly reduces the impact of outliers on your model (feature selection in the criterion parameter).
Your model will be prone to overfitting if the number of trees is too high. This problem is present in any tree-related algorithm and can be solved by properly tuning n_estimators parameter.
The computation can take a long time. Therefore, if you have a large dataset, always make a correct sample size and don’t forget to set the min_samples_leaf parameter correctly.

Despite the fact that gradient binning is widely used in all areas of data science, from Kaggle competitions to practical problems, many experts still use it as a black box. In this article, we break down the method into simpler steps to help readers understand the underlying processes of the algorithm. Good luck!