How to Use and Tune XGBOOST

Why XGBoost is Awesome?

One of the most implemented and widely used classifiers is XGBoost, and there are reasons for it

XGBoost has an ability to implement both Classification and Regression models using the same core algorithm. It is one of the most used algorithms in Kaggle and daily use case applications.

  1. First of all, it is an ensemble technique that works on voting methods and leverages the power of multiple small models.
  2. It is a regularized boosting technique
  3. Since it has multiple parameters so it provides the possibility for multiple designs and it is highly flexible.
  4. XG Boost can be implemented using multiple cores based on the number of parallel processing
  5. The main idea of XGBoost is to build on existing models and try to put emphasis on training erroneous observations.
  6. It handles the missing values, outliers, and cross-validation on its own which reduces data-cleaning efforts.
  7. It can also handle class imbalance problems, hence we don’t need to use SMOTE or any other oversampling technique to get a good result or balance our models.

Mechanism Overview

A simple mechanism overview of XGBoost and why it is different from other algorithms There are majorly two algorithms that are used when we talk about the ensemble methods, one being the boosting and the other being the bagging. Here we combine a bunch of weak learners/models / trees and use voting method to get the output. This technique provides inbuild regularization and more tree/ learners provide less chance of running in overfitting.

The only difference between any other boosting technique like gradient boosting and XGBoost is that it uses L1/L2 Regularization making it less prone to overfitting, but in core, it follows the same principle of minimizing the residuals and adding more weight to give weight to observations that deviated the most from accurate values. For more details, it is important to get into math of Gradient Boosting and take an example to oversimply it. You can find the dataset here at Classification Dataset

Reading in the Libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
#prediction and Classification Report
from sklearn.metrics import classification_report
#Let's use accuracy score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

#Loading in the data
data = pd.read_csv('Classification Data.csv')
data.head()

Build the Training Dataset and Preparing the Output

data_full = data.copy()
X_data = data_full.drop('Outcome', axis=1)
y = data_full.Outcome

#Split the dataset into train and Test
seed = 7
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X_data, y, test_size=test_size, random_state=seed)

# make a empty list  for model performance
model_performances=[]

Default and Randomly Tuned XGBoost

#Train the XGboost Model for Classification
model1 = xgb.XGBClassifier()
model2 = xgb.XGBClassifier(n_estimators=100, max_depth=8, learning_rate=0.1, subsample=0.5)
train_model1 = model1.fit(X_train, y_train)
train_model2 = model2.fit(X_train, y_train)
pred1 = train_model1.predict(X_test)
pred2 = train_model2.predict(X_test)

Output Evaluations for Accuracy

#print out the accuracy scores
print("Accuracy for default XGBoost Model: %.2f" % (accuracy_score(y_test, pred1) * 100))
print("Accuracy for Randomly tuned Model: %.2f" % (accuracy_score(y_test, pred2) * 100))
model_performances.append(['Default XGBoost',accuracy_score(y_test, pred1) * 100])
model_performances.append(['Randomly Tuned XGBoost',accuracy_score(y_test, pred2) * 100])

Hyperparameters and Tuning Models

Hyperparameters are the default setting that can be tuned for a model to just like a radio so it can resonate with the frequency of the data, the most apt your tuning is the better music clarity or in this case data pattern clarity our model learns. Hence they are the parameters that determine the learning path of our model algo.

The most unique thing about XGBoost is that it has many hyperparameters and provides a greater degree of flexibility, but at the same time it becomes important to hyper-tune them to get most of the data, something which is less required in simple models.

Types of XGBoost Parameters

1. General Parameters Booster, Verbosity, and Nthread
2. Booster Parameters Tree Booster, Linear Booster, eta, gamma, max_depth, min_child_weight, max_delta_step, subsample, colsample_bytree, colsample_bylevel, colsample_bynode, lambda , alpha,tree_method, scale_pos_weight, max_leaves
3. Learning Task Parameters objective, eval_metric, seed


General Parameters

1. Booster – Default Value (gbtree) It has three models – gbtree(tree based), dart(tree based), gblinear(linear model)
2. Verbosity – To understand more about the model parameters, should be set to 1 to activate the messages
3. NThreads – Default Values (max) – Number of threads(cores) the developer wishes to run while running the model

There is a whole host of parameters that you can use to tune this model.

https://xgboost.readthedocs.io/en/stable/parameter.html

Using GridSearch with XGBoost

Let’s see how the results stack up with a randomly tunned model

#Let's do a little Gridsearch, Hyperparameter Tunning
# For our use case we have picked some of the important one, a deeper method would be to just pick everyone and everything
model3 = xgb.XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
#  build the training set and model application 
train_model3 = model3.fit(X_train, y_train)
pred3 = train_model3.predict(X_test)
print("Accuracy for Randomly tuned model: %.2f" % (accuracy_score(y_test, pred3) * 100))
model_performances.append(['Second Randomly Tuned XGBoost',accuracy_score(y_test, pred3) * 100])

Fully Tuned Model with GridSearch CV

param_test = {
 'max_depth':[4,5,6,7,8,9,10],
 'min_child_weight':[2,4,5,6],
 'learning_rate':[0.1,0.01,0.05],
 'n_estimators':[100,10,1000],
 'objective':['binary:logistic','binary:hinge','binary:logitraw']
}
gsearch = GridSearchCV(estimator = xgb.XGBClassifier(
    gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test, scoring='roc_auc',n_jobs=4, cv=5)

train_model4 = gsearch.fit(X_train, y_train)
pred4 = train_model4.predict(X_test)
print("Accuracy for Fully Tuned Model: %.2f" % (accuracy_score(y_test, pred4) * 100))
model_performances.append(['Fully Tuned XGBoost',accuracy_score(y_test, pred4) * 100])

Simple Logistic Model vs XGBoost

clf = LogisticRegression(random_state=0).fit(X_train, y_train)
pred_log = clf.predict(X_test)
pd.DataFrame(classification_report(y_test, pred_log,output_dict=True))
print("Accuracy for Logistic Model: %.2f" % (accuracy_score(y_test, pred1) * 100))
model_performances.append(['Logistic Regression',accuracy_score(y_test, pred_log) * 100])

Let’s Compare All Model’s Performance

#create an dataframe
model_performances=pd.DataFrame(model_performances,columns=['Model','Accuracy Score'])
#visualize the results
import seaborn as sns
sns.color_palette("mako", as_cmap=True)
ax=sns.barplot(data=model_performances, x="Accuracy Score", y="Model")

Variable Importance

Decoding the Variable Importance in XGBoost

Variable importance tells us which feature is most important for our model. This is based on three criteria, which are used in the backhand, there might be more but the most important ones are Gain, Cover and Frequency.

Gain – Relative Contribution of each feature, higher gain means a higher ability of a variable to split the data into more homogenous branches

Cover – It means the number of observations that get impacted due to this particular feature when it comes to feature allocation

Frequency – The number of times a particular variable is picked up during the model development when different trees are implemented, so if we have 3 trees and a variable is picked 2 times, we could say a frequency of 66% and so on

These three combine to generate a variable importance score.

model_importance=pd.DataFrame(model3.feature_importances_)
model_importance.columns=['Importance']
model_importance['Columns']=X_train.columns
sns.color_palette("mako", as_cmap=True)
ax=sns.barplot(data=model_importance, x="Importance", y="Columns")

Gaelim Holland

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments