Boosting Techniques in Machine Learning

  • date 15th February, 2020 |
  • by Prwatech |
  • 0 Comments

Boosting Techniques in Machine Learning

Boosting Techniques in Machine Learning, in this Tutorial one can learn the Boosting algorithm introduction. Are you the one who is looking for the best platform which provides information about different types of boosting algorithm? Or the one who is looking forward to taking the advanced Data Science Certification Course from India’s Leading Data Science Training institute? Then you’ve landed on the Right Path.

Machine learning is vital for many of the technologies that seek to provide intelligence to the data, and companies recognize the great value. Boosting is a meta-algorithm joint learning machine to mainly reduce bias, and also variation in supervised learning, and a family of machine learning algorithms that convert students’ weaknesses to strengths.

The Below mentioned Tutorial will help to Understand the detailed information about boosting techniques in machine learning, so Just follow all the tutorials of India’s Leading Best Data Science Training institute in Bangalore and Be a Pro Data Scientist or Machine Learning Engineer.

Boosting Algorithm Introduction

As seen in the introduction part of ensemble methods, boosting is one of the advanced ensemble methods which improve overall performance by decreasing bias. Boosting is a consecutive process, where each succeeding model attempts to correct the errors of the preceding model.

Different Types of Boosting Algorithm

There are mainly five types of boosting techniques.

AdaBoost

Gradient Boosting (GBM)

Light GBM

XGBoost

CatBoost

Let’s see more about these types.

AdaBoost Algorithm in Machine Learning

One of the simplest boosting algorithms is AdaBoost. Usually, decision trees are used for modeling. Here multiple sequential models are created. Each model corrects the errors from the last model. It assigns weights to the observations which are incorrectly predicted. The succeeding model works to predict these values correctly.  Ada-boost classifier combines weak classifier algorithms to form a strong classifier. The mathematical equation for AdaBoost can be represented as follows:

Boosting Techniques in Machine Learning

where

f= mth weak classifier

m= corresponding weight.

It is exactly the weighted combination of M weak classifiers. The steps for performing the AdaBoost algorithm are as follows:

Initially, equal weights are assigned to all data points in the dataset.

A subset of data is used to build the model.

Predictions are made based on this model, for the whole dataset.

By comparing the predictions and actual values the errors are measured.

In the creation of the next model, the data point which is predicted incorrectly, are assigned with higher weights.

With the help of error, value weights can be determined. This means if the error is high then the assigned weight is also high for the corresponding data point.

Until the error function does not change, or the maximum boundary of the number of estimators is reached, this process is repeated.

Now let’s take one example of employee attrition.

Initializing and importing libraries

Import pandas as pd

Import numpy as np

Reading File

df=pd.read_csv(“Your File Path”)

df.head()

Splitting dataset into train and test

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.3, random_state=0)

x_train=train.drop(‘status’,axis=1)

y_train=train[‘status’]

x_test=test.drop(‘status’,axis=1)

y_test=test[‘status’]

Applying AdaBoost:

from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier(random_state=1)

model.fit(x_train, y_train)

model.score(x_test,y_test)

Output:

0.9995555555555555

Note: In the case of regression, the steps will be the same. Just we have to replace AdaBoostClassifier with AdaBoostRegressor.

 

Gradient Boosting (GBM) in Machine Learning:

Gradient Boosting or GBM is an ensemble machine learning algorithm which works on both regression and classification problems. In GBM, a number of weak learners are combined to form a strong algorithm. Here also each succeeding tree is built on the error calculation basis from the preceding trees. Here regression trees are used as a base learner. Let’s take one example to understand the working of this technique. We have to predict the age of the person.

Gender Height Weight BMI Physical Activity Age
M 160 85 33.20 1 35
F 155 64 26.64 0 27
M 170.7 95 28.14 0 28
F 185.4 65 23.27 1 28
F 158 70 36.05 1 32
F 155 90 35.38 0 28
M 173.7 72 23.86 1 22
F 161.5 74 23.77 0 33

The mean age is supposed to be the predicted value (which is indicated by‘Predicition1’) for all observations in the dataset. The difference between mean age and actual values of age is considered as errors, which is indicated by ‘ Error 1’.

Prediction 1:

Gender Height Weight BMI Physical Activity Age Mean Age Error
M 160 85 33.20 1 35 29 6
F 155 64 26.64 0 27 29 -2
M 170.7 95 28.14 0 28 29 -1
F 185.4 65 23.27 1 28 29 -1
F 158 70 36.05 1 32 29 3
F 155 90 35.38 0 28 29 -1
M 173.7 72 23.86 1 22 29 -7
F 161.5 74 23.77 0 33 29 4

Using Error 1 calculation, a tree model is designed. The purpose is to reduce that error to 0.

Prediction 2:

Gender Physical Activity Age Mean Age(prediction 1) Error1 Prediction 2 Mean + prediction 2 
M 1 35 29 6 4 33
F 0 27 29 -2 -1 28
M 0 28 29 -1 -1 29
F 1 28 29 -1 0 29
F 1 32 29 3 1 30
F 0 28 29 -1 1 30
M 1 22 29 -7 -2 27
F 0 33 29 4 3 32

The error values are modified with some prediction. These are indicated by ‘Prediction 2’. If we add previously calculated mean and prediction 2, we should get a value of ‘Age’ approaching to actual Age. That means the error or difference between the actual value and predicted value should be decreased. Similarly, in each iteration, the residue taken for current prediction is used to predict the next stage output. This process is repeated until the maximum number of iterations reached. Now let’s see an example.

Example:

We will take the same example as the AdaBoosting technique. We will apply GBM on training and testing sets of the dataset.

from sklearn.ensemble import GradientBoostingClassifier

model= GradientBoostingClassifier(learning_rate=0.01,random_state=1)

model.fit(x_train, y_train)

model.score(x_test,y_test)

Output:

accuracy_score on test dataset :  0.7595555555555555

Note: In the case of regression, the steps will be the same. Just we have to replace GradientBoostingClassifier with GradientBoostingRegressor.

Light GBM Machine Learning:

When the dataset is extremely large, Light GBM is the most useful in that case. This technique is faster in terms of running huge data, compared to the other algorithms. Unlike other level-wise approaches following algorithms, it uses a tree-based algorithm and follows a leaf-wise approach. Leaf-wise development may cause over-fitting on smaller datasets which can be handled by a parameter named ‘max_depth’.

We will apply the Light GBM train data set and the model will predict for the test dataset.

First, install function

pip install lightgbm

Applying the algorithm to data sets

import lightgbm as lgb

train_data=lgb.Dataset(x_train,label=y_train)

#defining parameters

params = {‘learning_rate’:0.001}

model= lgb.train(params, train_data, 100)

y_pred=model.predict(x_test)

for i in range(0,4500):    # 4500 is number of datapoints iy_pred

if y_pred[i]>=0.5:

y_pred[i]=1

else:

y_pred[i]=0

# Accuracy Score on a test dataset

acc_test = accuracy_score(y_test,y_pred)

print(‘\naccuracy_score on test dataset : ‘, acc_test)

Output:

0.958

 

XGBoost in Machine Learning:

XGBoost means extreme Gradient Boosting. It is one of the advanced implementations of the gradient boosting algorithm. XGBoost is nearly 10 times faster compared to others, and it has high predictive power. It also helps in reducing overfitting issues and improves the overall accuracy of the model. It is also called the ‘regularized boosting’ technique. On the same dataset, we will apply the XGBoost technique.

Import that library

pip install xgboost

Applying XGBoost on the next data set.

import xgboost as xgb

model=xgb.XGBClassifier(random_state=1,learning_rate=0.01)

model.fit(x_train, y_train)

model.score(x_test,y_test)

Output:

0.9995555555555555

CatBoost in Machine Learning:

It is difficult to handle large amounts of labeled data in case of the categorical values. Means if data is having many categorical variables then it is difficult to process the label encoding operation. In that case, ‘CatBoost’ can handle categorical variables and does not require extensive data preprocessing like other ML models. Let’s apply the technique to the same dataset.

Import library

pip install catboost

Applying CatBoost

from catboost import CatBoostClassifier

model=CatBoostClassifier()

categorical_features_indices = np.where(df.dtypes != np.float)[0]

model.fit(x_train,y_train,cat_features=([0,1,2,3,4]),eval_set=(x_test, y_test))

model.score(x_test,y_test)

Output:

0.9991111111111111

We hope you understand the boosting techniques in machine learning. Get success in your career as a Data Scientist by being a part of the Prwatech, India’s leading Data Science training institute in Bangalore.

Quick Support

image image