Introduction
What is boosting?
Boosting is an ensemble method of converting weak learners into strong learners. Weak and strong refer to a measure how correlated are the learners to the actual target variable[^1]. In boosting, each training sample are used to train one unit of decision tree and picked with replacement over-weighted data. The trees will learn from predecessors and updates the residuals error.
Learning Objectives
The goal of this article is to help you:
- Understand the concept of boosting
- Compare boosting and bagging method
- Understand how AdaBoost algorithm works
- Understand how XGBoost algorithm works
- Implement AdaBoost and Xgboost in business case
Library and setup
library(tidyverse)
library(rsample)
library(xgboost)
library(ggthemes)
library(tictoc)
library(fastAdaboost)
library(tidymodels)
library(inspectdf)
library(caret)
theme_set(theme_pander())
Bagging vs Boosting
The idea of bagging is creating many subsets of training sample with replacement, each observation has the same probability to picked as sample. Then, each training sample are used to train one unit of decision tree and use the average of all the predictions. In boosting, each training sample are used to train one unit of decision tree and picked with replacement over-weighted data. The trees will learn from predecessors and updates the residuals error. After these weak learners are trained, a weighted average of their estimates are taken for the final predictions at the end[^2].
Boosting Method
The different method of boosting algorithm are “How they create the weak learners during the iterative process”:
AdaBoost
Adaptive boosting was formulated by Yoav Freund and Robet Schapire. AdaBoost was the first practical boosting algorithm, and remains one of the most widely used and studied, with applications in numerous fields. AdaBoost algorithm works on changes the sample distribution by modifying weight data points for each iteration.
How AdaBoost Works?
We can split the idea of AdaBoost into 3 big concept :
1. Used Stump as Weak Learners
Weak learners is any model that has a accuracy better than random guessing even if it is just slightly better (e.g 0.51). In an Ensemble methods we combines multiple weak learners to make strong learner model. In AdaBoost, weak learners are used, a 1-level decision tree (Stump).The main idea when creating a weak classifier is to find the best stump that can separate data by minimizing overall errors.
2. Influence the Next Stump
Unlike bagging, which makes models in parallel, Boosting does training sequentially, which means that each stump (weak learner) is affected by the previous stump. The way Stump affects the next stump is by giving different weights to the data that will be used in the next stump maknig process. This weighting is based on error calculations, if a data is incorrectly predicted in the first stump, then the data will be given a greater weight in the next stump-making process.
3. Weighted Vote
In AdaBoost algorithm, each stump has a different weight, the weight for each stump is based on the resulting error rate. The smaller errors generated by a stump, the greater the weight of the stump. The weight of each stump is used in the voting process, if the greater the total weight obtained by one of the classes, then that class will be used as the final class.
Case Example using AdaBoost
The hotel is one of the lodgings most often used when traveling. With limited hotel capacity, canceling a reservation can be detrimental to the person providing the hotel service. In this case, we will predict hotel cancellations using data Hotel Reservation Requests taken from Kaggle.
booking <- read.csv("data_input/xgboost/hotel_bookings.csv", stringsAsFactors = T)
head(booking)
#> hotel is_canceled lead_time arrival_date_year arrival_date_month
#> 1 Resort Hotel 0 342 2015 July
#> 2 Resort Hotel 0 737 2015 July
#> 3 Resort Hotel 0 7 2015 July
#> 4 Resort Hotel 0 13 2015 July
#> 5 Resort Hotel 0 14 2015 July
#> 6 Resort Hotel 0 14 2015 July
#> arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
#> 1 27 1 0
#> 2 27 1 0
#> 3 27 1 0
#> 4 27 1 0
#> 5 27 1 0
#> 6 27 1 0
#> stays_in_week_nights adults children babies meal country market_segment
#> 1 0 2 0 0 BB PRT Direct
#> 2 0 2 0 0 BB PRT Direct
#> 3 1 1 0 0 BB GBR Direct
#> 4 1 1 0 0 BB GBR Corporate
#> 5 2 2 0 0 BB GBR Online TA
#> 6 2 2 0 0 BB GBR Online TA
#> distribution_channel is_repeated_guest previous_cancellations
#> 1 Direct 0 0
#> 2 Direct 0 0
#> 3 Direct 0 0
#> 4 Corporate 0 0
#> 5 TA/TO 0 0
#> 6 TA/TO 0 0
#> previous_bookings_not_canceled reserved_room_type assigned_room_type
#> 1 0 C C
#> 2 0 C C
#> 3 0 A C
#> 4 0 A A
#> 5 0 A A
#> 6 0 A A
#> booking_changes deposit_type agent company days_in_waiting_list customer_type
#> 1 3 No Deposit NULL NULL 0 Transient
#> 2 4 No Deposit NULL NULL 0 Transient
#> 3 0 No Deposit NULL NULL 0 Transient
#> 4 0 No Deposit 304 NULL 0 Transient
#> 5 0 No Deposit 240 NULL 0 Transient
#> 6 0 No Deposit 240 NULL 0 Transient
#> adr required_car_parking_spaces total_of_special_requests reservation_status
#> 1 0 0 0 Check-Out
#> 2 0 0 0 Check-Out
#> 3 75 0 0 Check-Out
#> 4 75 0 0 Check-Out
#> 5 98 0 1 Check-Out
#> 6 98 0 1 Check-Out
#> reservation_status_date
#> 1 2015-07-01
#> 2 2015-07-01
#> 3 2015-07-02
#> 4 2015-07-02
#> 5 2015-07-03
#> 6 2015-07-03
The data contains 119390 observations and 32 variables. Here some description of each feature:
-
hotel
: Hotel (H1 = Resort Hotel or H2 = City Hotel) -
is_canceled
: Value indicating if the booking was canceled (1) or not (0) -
lead_time
: Number of days that elapses between the entering date of the booking into the PMS and the arrival date -
arrival_date_year
: Year of arrival date -
arrival_date_month
: Month of arrival data -
arrival_date_week_number
: Week number of year for arrival date -
arrival_date_day_of_month
: Day of arrival date -
stays_in_weekend_nights
: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel -
adults
: Number of adults -
children
: Number of children -
babies
: Number of babies meal
: Type of meal booked. Categories are presented in standard hospitality meal packages:- Undefined/SC : no meal package;
- BB : Bed & Breakfast;
- HB : Half board (breakfast and one other meal-usually dinner);
- FB : Full board (breakfast, lunch, and dinner)
-
country
: Country of origin. Categories are represented in the ISO 3155-3:2013 format -
market_segment
: Market segment designation. In categories, the term “TA” means “Travel agents” and “TO” means “Tour Operators” -
distribution_channel
: Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” -
is_repeated_guest
: Value indicating if the booking name was from a repeated guest (1) or not (0) -
previous_cancellations
: Number of previous bookings that were cancelled bu the customer prior to the current booking -
previous_bookings_not_canceled
: Number of previous bookings not cancelled by the customer prior to the current booking -
reserved_room_type
: Code of room type reserved. Code is represented instead of designation for anonymity reasons -
assigned_room_type
: Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel opeartions reasons (e.g overbooking) or by customer request. Code is presented instead of designation for anonymity reasons -
booking_changes
: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in cancellation deposit_type
: Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:- No deposit – no deposit was made;
- Non refund – a deposit was made in the value of the total stay cost;
- Refundable – a deposit was made with a value under the total cost of stay
-
agent
: ID of the travel agency that made the booking -
company
: ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instad of designation for anonymity reasons -
days_in_waiting_list
: Number of days the booking was in the waiting list before it was confirmed to the customer customer_type
: Type of booking, assuming one of four categories:- Contract – when the booking has an allotment or other type of contract associated to it;
- Group – when the booking is associated to a group;
- Transient – when the booking is not part of a group or contract, and is not associated to other transient booking
- Transient-party – when the booking is transient, but is associated to at least other transient booking
-
adr
: Average daily rate as defined by dividing the sum of all lodging transactions by the total number of staying nights -
required_car_parking_spaces
: Number of car parking spaces required by the customer -
total_of_special_requests
: Number of special requests made by the customer (e.g. twin bed or high floor) reservation_status
: Reservation las status, assuming one of three categories:- Canceled – booking was canceled by the customer;
- Check-out – customer has checked in but already departed;
- No-Show – customer did not check-in and did inform the hotel of the reason why
-
reservation_status_date
: Date a which the last status was set. This variable can be used in conjuction with the reservation status to understand when was the booking canceled or when did the customer checked-out of the model.
The model prediction will help hotel to predict the guest will cancel or not cancel the booking hotel. We will remove variables agent
and company
because there are have a lot of levels, and also we remove reservation_status
and reservation_status_date
.
booking <- booking %>%
select(-c(reservation_status_date, agent, company,
reservation_status)) %>%
mutate(is_canceled = as.factor(is_canceled))
Exploratory Data Analysis
Before we go further, we need to check if there are any missing values in data. We can used inspect_na()
function from inspectdf
package to check the missing value.
booking %>%
inspect_na()
#> # A tibble: 28 x 3
#> col_name cnt pcnt
#> <chr> <int> <dbl>
#> 1 children 4 0.00335
#> 2 hotel 0 0
#> 3 is_canceled 0 0
#> 4 lead_time 0 0
#> 5 arrival_date_year 0 0
#> 6 arrival_date_month 0 0
#> 7 arrival_date_week_number 0 0
#> 8 arrival_date_day_of_month 0 0
#> 9 stays_in_weekend_nights 0 0
#> 10 stays_in_week_nights 0 0
#> # ... with 18 more rows
From the result above variable children
have missing values with 4 observation, let’s fill the missing value with the 0.
booking <- booking %>%
mutate(children = replace_na(children,0))
Now let’s check the condition of the categorical variable using inspect_cat()
function.
booking %>%
inspect_cat()
#> # A tibble: 11 x 5
#> col_name cnt common common_pcnt levels
#> <chr> <int> <chr> <dbl> <named list>
#> 1 arrival_date_month 12 August 11.6 <tibble [12 x 3]>
#> 2 assigned_room_type 12 A 62.0 <tibble [12 x 3]>
#> 3 country 178 PRT 40.7 <tibble [178 x 3]>
#> 4 customer_type 4 Transient 75.1 <tibble [4 x 3]>
#> 5 deposit_type 3 No Deposit 87.6 <tibble [3 x 3]>
#> 6 distribution_channel 5 TA/TO 82.0 <tibble [5 x 3]>
#> 7 hotel 2 City Hotel 66.4 <tibble [2 x 3]>
#> 8 is_canceled 2 0 63.0 <tibble [2 x 3]>
#> 9 market_segment 8 Online TA 47.3 <tibble [8 x 3]>
#> 10 meal 5 BB 77.3 <tibble [5 x 3]>
#> 11 reserved_room_type 10 A 72.0 <tibble [10 x 3]>
From the result above, the country column has 178 unique value. We will reduce the unique value of the country
to 11, namely by taking the 10 countries that appear most frequently and other countries will be changed to other.
booking <- booking %>%
mutate(country = fct_lump_n(country, n = 10))
booking %>%
inspect_cat() %>%
show_plot()
Before we do the modeling, let’s first check the proportions of the target class to find out how balanced the target class.
booking %>%
pull(is_canceled) %>%
table() %>%
prop.table()
#> .
#> 0 1
#> 0.6295837 0.3704163
Class with label 0 has a proportion of about 63% of the data while class with label 1 has a proportion of 37%, this shows that class with label 0 is more dominant.
Modelling
We’ll create our training and testing data using initial_split
function
set.seed(100)
splitted <- initial_split(booking, prop = 0.8,strata = is_canceled)
data_train <- training(splitted)
data_test <- testing(splitted)
The function used to create the AdaBoost model is adaboost()
from the fastAdaboost
package. There are 3 parameters that can be filled in this function:
– formula
: Formula for models
– data
: Data used in the modeling process
– nIter
: Number of stumps used on the model
model_ada <- adaboost(formula = is_canceled~.,
data = data_train,
nIter = 100)
As we know each stump in the model has a different weight, the weight of each stump can be seen in model_ada$weights
. When the weights are visualized, it will be seen that the stump formed at the end of the iteration has a smaller weight when compared to the stump formed at the beginning of the iteration.
plot_weights <- data.frame(stump_id = c(1:100),
weight = model_ada$weights) %>%
ggplot(aes(y = weight, x = stump_id)) +
geom_col(fill = "dodgerblue3")
plot_weights
Now let’s predict the test dataset
pred_hotel <- predict(object = model_ada, newdata = data_test)
str(pred_hotel)
#> List of 5
#> $ formula:Class 'formula' language is_canceled ~ .
#> .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> $ votes : num [1:23877, 1:2] 22.2 23.6 14.2 22.5 17.4 ...
#> $ class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 2 ...
#> $ prob : num [1:23877, 1:2] 0.9 0.956 0.576 0.912 0.704 ...
#> $ error : num 0.113
the predicted object has several components :
– $votes : Total weighted votes achieved by each class
– $class : The class predicted by the classifier
– $prob : A matrix with predicted probability of each class for each observation
– $eror : The error on the test data if labeled. (1-accuracy)
Now let’s check how good our model using confusion matrix
confusionMatrix(data = pred_hotel$class, reference = data_test$is_canceled, positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 13725 1400
#> 1 1308 7444
#>
#> Accuracy : 0.8866
#> 95% CI : (0.8825, 0.8906)
#> No Information Rate : 0.6296
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.7563
#>
#> Mcnemar's Test P-Value : 0.08034
#>
#> Sensitivity : 0.8417
#> Specificity : 0.9130
#> Pos Pred Value : 0.8505
#> Neg Pred Value : 0.9074
#> Prevalence : 0.3704
#> Detection Rate : 0.3118
#> Detection Prevalence : 0.3665
#> Balanced Accuracy : 0.8773
#>
#> 'Positive' Class : 1
#>
Based on the confusion matrix above, we know that the accuracy of model is 0.88. Since we know that our data is dominated by the class labeled 0 (67%) we have to use another metrics to find out how well our model predicts the two classes. we’re going to use the AUC.
pred_df <- pred_hotel$prob %>%
as.data.frame() %>%
rename(class0 = V1,
class1 = V2) %>%
mutate(predicted = pred_hotel$class,
actual = data_test$is_canceled)
auc_ada <- roc_auc(data = pred_df, truth = actual,class1)
auc_ada
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 roc_auc binary 0.955
The AUC results show that the model formed is good at predicting the target class, this is indicated by the AUC value of 0.95 (the closer to 1 the better).
AdaBoost has a lot of advantages, mainly it is easier to use with less need for tweaking parameters unlike algorithms like XGBoost. AdaBoost also can reduce the variance in testing data.
XGBoost
XGBoost was formulated by Tianqi Chen which started as a research project a part of The Distributed Deep Machine Leaning Community (DMLC) grop. XGBoost is one of popular algorithm because it has been the winning algorithm in a number of recent Kaggle competitions. XGBoost is a specific implementation of the Gradient Boosting Model which uses more accurate approximations to find the best tree model[^2]. XGBoost specifically used a more regularized model formalization to control overfitting, which gives it better perfomance.
How XGBoost works?
System Optimization: [^5]
- Parallelized tree building
XGBoost approaches the process of sequential tree building using parrellelized implementation.
- Tree pruning
Unlike GBM, where tree pruning stops once a negative loss is encountered, XGBoost grows the tree up to max_depth and then prune backward until the improvement in loss function is below a threshold.
- Cache awareness and out of core computing
XGBoost has been designed to efficiently reduce computing time and allocate an optimal usage of memory resources. This is accomplished by cache awareness by allocating internal buffers in each thread to store gradient statistics. Further enhancements such as ‘out-of-core’ computing optimize available disk space while handling big data-frames that do not fit into memory.
- Regularization
The biggest advantage of xgboost is regularization. Regularization is a technique used to avoid overfitting in linear and tree based models which limits, regulates or shrink the estimated coefficient towards zero.
- Handles missing value
This algorithm has important features of handling missing values by learns the best direction for missing values. The missing values are treated them to combine a sparsity-aware split finding algorithm to handle different types of sparsity patterns in data.
- Built-in cross validation
The algorithm comes with built in cross validation method at each iteration, taking away the need to explicitly program this search and to specify the exact number of boosting iterations required in a single run.
Regularization and training loss
XGBoost offers additional regularization term controls the complexity of the model, which help us to avoid overfitting. The objective function is to measure how well the model fit the training data. They consist of two parts: training loss and the regularization term:
\(obj(\theta )= L(\theta )+\Omega (\theta )\)
Where \(L\) is training loss function
and \(\Omega\) is regularization. Training loss function
measures how well model fit on training data, \(\Omega\) will reduce the complexity of the tree functions.[^3]
For regression case, training loss function
will obtain from Mean Squared Error
value:
\(L(\theta ) = {\sum_i^{n}(y_i-\hat{y}_i)^2}\)
Another loss function for classification case:
\(L(\theta ) = {\sum_i[y_iln(1+e^{-\hat{y}_i})+(1-y_i)ln(1+e^{\hat{y}_i})]}\)
Case Example using XGBoost
Modelling
set.seed(100)
splitted <- initial_split(booking, prop = 0.8,strata = is_canceled)
data_train <- training(splitted)
data_test <- testing(splitted)
Split the target variable into label_train
and label_test
label_train <- as.numeric(as.character(data_train$is_canceled))
label_test <- as.numeric(as.character(data_test$is_canceled))
The most important thing when we work with XGBoost is converting the data to Dmatrix, because XGBoost requires a matrix input for the features.
# convert data to matrix
train_matrix <- data.matrix(data_train[,-2])
test_matrix <- data.matrix(data_test[,-2])
# convert data to Dmatrix
dtrain <- xgb.DMatrix(data = train_matrix, label = label_train)
dtest <- xgb.DMatrix(data = test_matrix, label = label_test)
Tuning Parameters
There is no benchmark to define the ideal parameters because it will depend on your data and specific problem. XGBoost Parameters can defined into three categories:[^6]
General Parameters
Controls the booster type in the model which eventually drives overall functioning
- booster
For classification problems, we can use gbtree
parameter. In gbtree
a tree is grown one after other and attempts to reduce misclassification reate in subsequent iterations. The next tree is built by giving a higher weight to misclassified points by the previous tree.
For regression problems, we can use gbtree
and gblinear
. In gblinear
, it builds a generalized linear model and optimizes it using regularization and gradient descent. The next model will built on residuals generated by previous iterations.
- nthread
To enable parallel computing. The default is the maximum number of cores available
- verbosity
Verbosity to display warning messages.The default value is 1 (warning), 0 for silent, 2 for info, and 3 for debug.
Booster Parameters:
Controls the performance of the selected booster
- Eta
The range of eta is 0 to 1 and default value is 0.3. It controls the maximum number of iterations, the lower eta will generate the slower computation.
- Gamma
The range of gamma is 0 to infinite and default value is 0 (no regularization). The higher gamma is the higher regularization, regularization means penalizing large coefficients that don’t improve the model’s performance.
- nrounds
it refers to controls the maximum number of iterations.
- nfold
The number of observation data is randomly partitioned into nfold
equal size subsamples
- max_depth
Maximum depth of a tree. The range of max_depth is 0 to infinite and default value is 6, increasing this value will make the model more complex and more likely to overfit.
- Min_child_weight
The range of min_child_weight is 0 to infinite and default value is 1. If the leaf node has a minimum sum of instance weight lower than min_child_weight in the tree partition step than the process of splitting the tree will stop growing.
- subsample
The range of subsample is 0 to 1 and default value is 1. It controls the number of ratio observations to a tree. If the value is set to 0.5 means that XGboost would randomly sample half of the training data prior to growing trees and this will prevent overfitting. subsample will occur once in every boosting iteration.
- colsampe_bytree
The range of colsample_bytree is 0 to 1 and default value is 1. It controls the subsample ratio of columns when constructing each tree.
Learning Task Parameters
Sets and evaluates the learning process of booster from the given data.
- Objective
reg:squarederror
for regression with squared lossbinary:logistic
for binary classification
- eval_metric
Evaluation metrics for validaton data. The default is RMSE for regression case and error for classification case.
Next, we define the parameter will be used:
params <- list(booster = "gbtree",
objective = "binary:logistic",
eta=0.1,
gamma=10,
max_depth=10,
min_child_weight=1,
subsample=1,
colsample_bytree=1)
One of the simplest way to see the training progress is to set the verbose option as TRUE
.
tic()
xgbcv <- xgb.cv( params = params,
data = dtrain,
nrounds = 1000,
showsd = T,
nfold = 5,
stratified = T,
print_every_n = 50,
early_stopping_rounds = 20,
maximize = F)
#> [1] train-error:0.159371+0.000575 test-error:0.162041+0.002767
#> Multiple eval metrics are present. Will use test_error for early stopping.
#> Will train until test_error hasn't improved in 20 rounds.
#>
#> [51] train-error:0.134139+0.000721 test-error:0.140127+0.002438
#> [101] train-error:0.122489+0.000533 test-error:0.132233+0.001973
#> Stopping. Best iteration:
#> [114] train-error:0.121790+0.000932 test-error:0.131793+0.002073
print(xgbcv)
#> ##### xgb.cv 5-folds
#> iter train_error_mean train_error_std test_error_mean test_error_std
#> 1 0.1593710 0.0005746874 0.1620406 0.002766615
#> 2 0.1589574 0.0008488257 0.1616114 0.002560948
#> 3 0.1582874 0.0008635417 0.1612242 0.002240224
#> 4 0.1574078 0.0012001357 0.1600936 0.002275888
#> 5 0.1563270 0.0017212650 0.1593920 0.001805679
#> ---
#> 130 0.1217530 0.0009838978 0.1318562 0.002019269
#> 131 0.1217504 0.0009883673 0.1318562 0.002019269
#> 132 0.1217504 0.0009883673 0.1318562 0.002019269
#> 133 0.1217504 0.0009883673 0.1318562 0.002019269
#> 134 0.1217504 0.0009883673 0.1318562 0.002019269
#> Best iteration:
#> iter train_error_mean train_error_std test_error_mean test_error_std
#> 114 0.1217898 0.0009316736 0.1317934 0.002072701
toc()
#> 72.96 sec elapsed
tic()
xgb1 <- xgb.train (params = params,
data = dtrain,
nrounds = xgbcv$best_iteration,
watchlist = list(val=dtest,train=dtrain),
print_every_n = 100,
early_stoping_rounds = 10,
maximize = F ,
eval_metric = "error",
verbosity = 0)
#> [1] val-error:0.157641 train-error:0.158900
#> [101] val-error:0.125644 train-error:0.121533
#> [114] val-error:0.124011 train-error:0.119858
toc()
#> 15.86 sec elapsed
xgbpred_prob <-predict(object = xgb1, newdata = dtest)
xgbpred <- ifelse (xgbpred_prob > 0.5,1,0)
In this section, we evaluate the performance of XGBoost model
confusionMatrix(as.factor(xgbpred), as.factor(label_test))
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 13902 1830
#> 1 1131 7014
#>
#> Accuracy : 0.876
#> 95% CI : (0.8717, 0.8801)
#> No Information Rate : 0.6296
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.7297
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9248
#> Specificity : 0.7931
#> Pos Pred Value : 0.8837
#> Neg Pred Value : 0.8611
#> Prevalence : 0.6296
#> Detection Rate : 0.5822
#> Detection Prevalence : 0.6589
#> Balanced Accuracy : 0.8589
#>
#> 'Positive' Class : 0
#>
let’s check the variable importance from the model:
var_imp <- xgb.importance(model = xgb1,
feature_names = dimnames(dtrain)[[2]])
var_imp %>%
mutate_if(is.numeric, round, digits = 2)
#> Feature Gain Cover Frequency
#> 1: deposit_type 0.35 0.07 0.01
#> 2: country 0.13 0.12 0.10
#> 3: lead_time 0.09 0.11 0.13
#> 4: market_segment 0.07 0.06 0.04
#> 5: total_of_special_requests 0.06 0.06 0.03
#> 6: required_car_parking_spaces 0.05 0.08 0.02
#> 7: previous_cancellations 0.04 0.05 0.03
#> 8: arrival_date_year 0.04 0.04 0.06
#> 9: adr 0.03 0.10 0.14
#> 10: customer_type 0.02 0.03 0.04
#> 11: arrival_date_week_number 0.02 0.04 0.09
#> 12: reserved_room_type 0.02 0.03 0.03
#> 13: booking_changes 0.01 0.03 0.03
#> 14: assigned_room_type 0.01 0.05 0.03
#> 15: previous_bookings_not_canceled 0.01 0.02 0.02
#> 16: hotel 0.01 0.02 0.02
#> 17: stays_in_week_nights 0.01 0.02 0.03
#> 18: arrival_date_month 0.01 0.01 0.02
#> 19: arrival_date_day_of_month 0.01 0.01 0.03
#> 20: stays_in_weekend_nights 0.01 0.01 0.02
#> 21: meal 0.01 0.02 0.02
#> 22: distribution_channel 0.00 0.01 0.01
#> 23: adults 0.00 0.01 0.02
#> 24: children 0.00 0.01 0.01
#> 25: is_repeated_guest 0.00 0.01 0.01
#> 26: days_in_waiting_list 0.00 0.01 0.01
#> 27: babies 0.00 0.00 0.00
#> Feature Gain Cover Frequency
The function of xgb.importance
displays the result importance values calculated with different importance metrics:
-
The gain value means the percentage contribution of the feature for each tree in the model
-
The cover value means the percentage represents the number of observations for each feature from all trees. From example, if we have 100 observations with 3 tree and each tree have 5, 8, and 10 observations for feature “A”. The cover value will calculate 5+8+10 = 23 observations from all trees for each feature. In this case, the feature “A” has a 0.23 cover value.
-
The frequency value means the percentage representing the number of times a feature will splits in the trees of the model. For example, feature “A” occurred in 3 splits, 2 splits, and 2 splits for each tree. So the value of frequency feature “A” is 3+2+2=7 splits divide with all total numbers splits for all features.
xgb.ggplot.importance(var_imp,top_n = 10) + theme_minimal()
The graph shows the variable importance used the gain value by default and it also displays the cluster of features that have similar feature importances value. From 10 above features means their features have a significant impact on the result of prediction.
Next, we evaluate the perfomance model on the ROC curver
xgb_result <- data.frame(class1 = xgbpred_prob, actual = as.factor(label_test))
auc_xgb <- roc_auc(data = xgb_result, truth = actual,class1)
result <- rbind(auc_ada, auc_xgb) %>%
mutate(model = c("AdaBoost", "XGBoost")) %>%
select(model, everything())
result
#> # A tibble: 2 x 4
#> model .metric .estimator .estimate
#> <chr> <chr> <chr> <dbl>
#> 1 AdaBoost roc_auc binary 0.955
#> 2 XGBoost roc_auc binary 0.948
The AUC results show that AdaBoost and XGBoost model have similar value 0.94 and 0.95. To obtain the AdaBoost model we need to run model for 60 minutes, while the XGBoost model only need ~60 seconds. We can say that XGBoost works better than AdaBoost for speed.
Conclusion
In this article, we described the lesson on how to building and how AdaBoost and XGBoost model works. We can conclude several points:
-
Both of two algorithms are built based on converting weak learners to a strong learner
-
AdaBoost has only a few hyperparameters to improve the model but this model is easy to understand and to visualize
-
The decision which algorithm will be used depends on our data set, for low noise data and timeliness of result is not the main concern, we can use AdaBoost model
-
For complexity and high dimension data, XGBoost performs works better than Adaboost because XGBoost have system optimizations.
Reference
[^1] : XGBoost, a Top Machine Learning Method on Kaggle
[^2] : XGBoost: The Excalibur for Everyone
[^3] : Introduction to Boosted Trees
[^4] : Machine Learning Basics-Gradient Boosting & XGBoost
[^5] : XGBoost Algorithm
[^6] : Parameter Tuning in R