1. Setup
Libraries and Setup
We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.
You will need to use install.packages()
to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library()
function:
library(tidyverse)
library(caret)
2. Nested Dataframe
You’ll learn how to use purrr
, caret
and dplyr
to quickly create some of dataset + model combinations, store data & model objects neatly in one tibble, and post process programatically. These tools enable succinct functional programming in which a lot gets done with just a few lines of code. The data to be used is loan.csv which can be downloaded here link here. In this article we will predict the default
variable which has a yes or no value.
loan <- read.csv("data_input/loan.csv")
glimpse(loan)
#> Observations: 1,000
#> Variables: 17
#> $ checking_balance <fct> < 0 DM, 1 - 200 DM, unknown, < 0 DM, < 0 DM, u...
#> $ months_loan_duration <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48,...
#> $ credit_history <fct> critical, good, critical, good, poor, good, go...
#> $ purpose <fct> furniture/appliances, furniture/appliances, ed...
#> $ amount <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948...
#> $ savings_balance <fct> unknown, < 100 DM, < 100 DM, < 100 DM, < 100 D...
#> $ employment_duration <fct> > 7 years, 1 - 4 years, 4 - 7 years, 4 - 7 yea...
#> $ percent_of_income <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4...
#> $ years_at_residence <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2...
#> $ age <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24...
#> $ other_credit <fct> none, none, none, none, none, none, none, none...
#> $ housing <fct> own, own, own, other, other, other, own, rent,...
#> $ existing_loans_count <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1...
#> $ job <fct> skilled, skilled, unskilled, skilled, skilled,...
#> $ dependents <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ phone <fct> yes, no, no, no, no, yes, no, yes, no, no, no,...
#> $ default <fct> no, yes, no, no, yes, no, no, no, no, yes, yes...
loan <- loan %>%
head(-10)
test <- loan %>%
tail(10) %>%
select(-default)
The loan data will be divided into test data and loan data. Test data will be used when testing the model that has been made, while the data loan will be used to model the classification.
2.1 Single Data Frame x Multiple Model
Before creating a nested dataframe, we must prepare a model that will be used first. The model to be used must be used as a function to make it easier when used in the map ()
function that comes from the purrr
package. besides that we can set the parameters that will be used in the function. In the chunk below 2 models are created, namely the decision tree and random forest using the caret
package.
#create Random forest function
RandomForestModel <- function(X, Y){
ctrl <- trainControl(
method = "cv",
number = 3
)
train(
x = X,
y = Y,
trContrl = ctrl,
method = 'rf'
)
}
#create decision tree function
RpartModel <- function(X, Y) {
ctrl <- trainControl(
method = "repeatedcv",
number = 5
)
train(
x = X,
y = Y,
method = 'rpart2',
trControl = ctrl,
tuneGrid = data.frame(maxdepth=c(2,3,4,5)),
preProc = c('center', 'scale')
)
}
After making a model in the form of a function, then making the model into a dataframe.
model_list <- list(rpart = RpartModel,
rforest = RandomForestModel) %>%
enframe(name = 'modelName',value = 'model')
model_list
#> # A tibble: 2 x 2
#> modelName model
#> <chr> <list>
#> 1 rpart <fn>
#> 2 rforest <fn>
model_list
produces 2 columns, namely modelName
, and model
. ModelName
is the name of the model, and the model
contains the functions of the model.
Next the dataframe to be used replicates as many models as you want to use. The loan dataset will be replicates as much as the model used by rep ()
function.
nmodel <- length(model_list) #get length of model_list
nested.loan <- list(loan) %>%
rep(nmodel) %>%
enframe(name = "Id", value = "rawdata")
nested.loan
#> # A tibble: 2 x 2
#> Id rawdata
#> <int> <list>
#> 1 1 <df[,17] [990 x 17]>
#> 2 2 <df[,17] [990 x 17]>
nested.loan
has 2 columns, namely Id
and rawdata
which contain the loan dataframe. Then rawdata will be separated into train.y which contains the default
variable and train.x contains the others.
nested.loan <- nested.loan %>%
mutate(train.x = map(rawdata, ~select(.x, -default)),
train.y = map(rawdata, ~.x$default))
nested.loan
#> # A tibble: 2 x 4
#> Id rawdata train.x train.y
#> <int> <list> <list> <list>
#> 1 1 <df[,17] [990 x 17]> <df[,16] [990 x 16]> <fct [990]>
#> 2 2 <df[,17] [990 x 17]> <df[,16] [990 x 16]> <fct [990]>
The next step is to join nested.loan
withmodel_list
using bind_cols ()
nested.loan <- nested.loan %>%
bind_cols(model_list)
nested.loan
#> # A tibble: 2 x 6
#> Id rawdata train.x train.y modelName model
#> <int> <list> <list> <list> <chr> <list>
#> 1 1 <df[,17] [990 x 17]> <df[,16] [990 x 16]> <fct [990]> rpart <fn>
#> 2 2 <df[,17] [990 x 17]> <df[,16] [990 x 16]> <fct [990]> rforest <fn>
The model we have created can be used with the function invoke_map ()
which functions to combine functions and lists as parameters.
nested.loan <- nested.loan %>%
mutate(parm = map2(train.x, train.y, ~list(.x, .y)),
model = invoke_map(model,parm))
nested.loan
#> # A tibble: 2 x 7
#> Id rawdata train.x train.y modelName model parm
#> <int> <list> <list> <list> <chr> <list> <list>
#> 1 1 <df[,17] [990 x ~ <df[,16] [990 x ~ <fct [990~ rpart <trai~ <list [~
#> 2 2 <df[,17] [990 x ~ <df[,16] [990 x ~ <fct [990~ rforest <trai~ <list [~
To see how well the model has been made, it can be seen from the Accuracy of each model.
nested.loan <- nested.loan %>%
mutate(Accuracy = map_dbl(model, ~max(.x$results$Accuracy))) %>%
arrange(desc(Accuracy))
nested.loan
#> # A tibble: 2 x 8
#> Id rawdata train.x train.y modelName model parm Accuracy
#> <int> <list> <list> <list> <chr> <lis> <list> <dbl>
#> 1 2 <df[,17] [990~ <df[,16] [990~ <fct [99~ rforest <tra~ <list ~ 0.752
#> 2 1 <df[,17] [990~ <df[,16] [990~ <fct [99~ rpart <tra~ <list ~ 0.728
From the above results it can be seen that the random forest model produces an accuracy of 0.75 and the decission tree is 0.74. Next, we will do predict to the test data that has been made using an existing model. the data test must replicate as many models as used and then join the nested.loan
data using left_join()
nested.loan <- test %>%
list() %>%
rep(nrow(nested.loan)) %>%
enframe(name = "Id",value = "test.x") %>%
left_join(nested.loan, by = "Id")
nested.loan
#> # A tibble: 2 x 9
#> Id test.x rawdata train.x train.y modelName model parm Accuracy
#> <int> <list> <list> <list> <list> <chr> <lis> <list> <dbl>
#> 1 1 <df[,16]~ <df[,17] ~ <df[,16] ~ <fct [9~ rpart <tra~ <list~ 0.728
#> 2 2 <df[,16]~ <df[,17] ~ <df[,16] ~ <fct [9~ rforest <tra~ <list~ 0.752
Now we create a pred
variable that contains results from predict
nested.loan <- nested.loan %>%
mutate(pred = map2(model, test.x, ~predict(.x, .y))) %>%
select(Id,modelName,Accuracy,pred)
glimpse(nested.loan$pred)
#> List of 2
#> $ : Factor w/ 2 levels "no","yes": 2 1 1 2 1 1 1 1 2 2
#> $ : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 1 1 1
2.2 Multiple Data Frame x Single Model
Now we will split loan data by checking_balance
variable which has 4 levels namely < 0 DM
, > 200 DM
, 1 - 200 DM
, and unknown
nested.split <- loan %>%
group_by(checking_balance) %>%
nest(.key = "rawdata") %>%
mutate(train.x = map(rawdata, ~select(.x, -default)),
train.y = map(rawdata, ~.x$default))
nested.split
#> # A tibble: 4 x 4
#> # Groups: checking_balance [4]
#> checking_balance rawdata train.x train.y
#> <fct> <list> <list> <list>
#> 1 < 0 DM <tibble [270 x 16]> <tibble [270 x 15]> <fct [270]>
#> 2 1 - 200 DM <tibble [268 x 16]> <tibble [268 x 15]> <fct [268]>
#> 3 unknown <tibble [389 x 16]> <tibble [389 x 15]> <fct [389]>
#> 4 > 200 DM <tibble [63 x 16]> <tibble [63 x 15]> <fct [63]>
The model that will be used is random forest
nested.rf <- nested.split %>%
mutate(param = map2(train.x, train.y, ~list(.x, .y)),
model = invoke_map(RandomForestModel,param))
nested.rf
#> # A tibble: 4 x 6
#> # Groups: checking_balance [4]
#> checking_balance rawdata train.x train.y param model
#> <fct> <list> <list> <list> <list> <list>
#> 1 < 0 DM <tibble [270 x ~ <tibble [270 x ~ <fct [270~ <list [2~ <trai~
#> 2 1 - 200 DM <tibble [268 x ~ <tibble [268 x ~ <fct [268~ <list [2~ <trai~
#> 3 unknown <tibble [389 x ~ <tibble [389 x ~ <fct [389~ <list [2~ <trai~
#> 4 > 200 DM <tibble [63 x 1~ <tibble [63 x 1~ <fct [63]> <list [2~ <trai~
To see how well the model is made, we can see the accuracy obtained from the model we made.
nested.rf <- nested.rf %>%
mutate(Accuracy = map_dbl(model, ~max(.x$results$Accuracy)))
nested.rf
#> # A tibble: 4 x 7
#> # Groups: checking_balance [4]
#> checking_balance rawdata train.x train.y param model Accuracy
#> <fct> <list> <list> <list> <list> <lis> <dbl>
#> 1 < 0 DM <tibble [270~ <tibble [270~ <fct [27~ <list [~ <tra~ 0.641
#> 2 1 - 200 DM <tibble [268~ <tibble [268~ <fct [26~ <list [~ <tra~ 0.667
#> 3 unknown <tibble [389~ <tibble [389~ <fct [38~ <list [~ <tra~ 0.885
#> 4 > 200 DM <tibble [63 ~ <tibble [63 ~ <fct [63~ <list [~ <tra~ 0.761
2.3 Multiple Data Frame x Multiple Model
To run multi models against multi data, we must repeat data as much as the model that will be used. nested.split
is a data loan that is divided based on thechecking_balance
variable which contains 4 levels while the model used is 2, namely random forest and decision tree, the amount of data is 8 (4 X 2).
#replicates nested.split as much the model will be used
nested.multi <- nested.split %>%
list() %>%
rep(nmodel) %>%
bind_rows()
#replicates model_list as much the data frame
model.multi <- model_list %>%
list() %>%
rep(nrow(nested.split)) %>%
bind_rows() %>%
arrange(modelName)
#join nested.multi and model.multi
nested.multi <- nested.multi %>%
bind_cols(model.multi)
nested.multi
#> # A tibble: 8 x 6
#> # Groups: checking_balance [4]
#> checking_balance rawdata train.x train.y modelName model
#> <fct> <list> <list> <list> <chr> <lis>
#> 1 < 0 DM <tibble [270 x 1~ <tibble [270 x ~ <fct [270~ rforest <fn>
#> 2 1 - 200 DM <tibble [268 x 1~ <tibble [268 x ~ <fct [268~ rforest <fn>
#> 3 unknown <tibble [389 x 1~ <tibble [389 x ~ <fct [389~ rforest <fn>
#> 4 > 200 DM <tibble [63 x 16~ <tibble [63 x 1~ <fct [63]> rforest <fn>
#> 5 < 0 DM <tibble [270 x 1~ <tibble [270 x ~ <fct [270~ rpart <fn>
#> 6 1 - 200 DM <tibble [268 x 1~ <tibble [268 x ~ <fct [268~ rpart <fn>
#> 7 unknown <tibble [389 x 1~ <tibble [389 x ~ <fct [389~ rpart <fn>
#> 8 > 200 DM <tibble [63 x 16~ <tibble [63 x 1~ <fct [63]> rpart <fn>
Now we can modeling each data category with each model
nested.multi <- nested.multi %>%
mutate(param = map2(train.x, train.y, ~list(.x, .y)),
model = invoke_map(model,param))
nested.multi
#> # A tibble: 8 x 7
#> # Groups: checking_balance [4]
#> checking_balance rawdata train.x train.y modelName model param
#> <fct> <list> <list> <list> <chr> <lis> <list>
#> 1 < 0 DM <tibble [270~ <tibble [270~ <fct [27~ rforest <tra~ <list ~
#> 2 1 - 200 DM <tibble [268~ <tibble [268~ <fct [26~ rforest <tra~ <list ~
#> 3 unknown <tibble [389~ <tibble [389~ <fct [38~ rforest <tra~ <list ~
#> 4 > 200 DM <tibble [63 ~ <tibble [63 ~ <fct [63~ rforest <tra~ <list ~
#> 5 < 0 DM <tibble [270~ <tibble [270~ <fct [27~ rpart <tra~ <list ~
#> 6 1 - 200 DM <tibble [268~ <tibble [268~ <fct [26~ rpart <tra~ <list ~
#> 7 unknown <tibble [389~ <tibble [389~ <fct [38~ rpart <tra~ <list ~
#> 8 > 200 DM <tibble [63 ~ <tibble [63 ~ <fct [63~ rpart <tra~ <list ~
nested.multi$model[[1]]$finalModel
#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry, trContrl = ..1)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 8
#>
#> OOB estimate of error rate: 34.44%
#> Confusion matrix:
#> no yes class.error
#> no 90 46 0.3382353
#> yes 47 87 0.3507463
And we can see the accuracy each model
nested.multi <- nested.multi %>%
mutate(Accuracy = map_dbl(model, ~max(.x$results$Accuracy)))
library(lime)
model_result <- lime(x = nested.multi$train.x[[1]],nested.multi$model[[1]])
model_result$bin_continuous
#> [1] TRUE