Nested Dataframe

1. Setup

Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

You will need to use install.packages() to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library() function:

library(tidyverse)
library(caret)

2. Nested Dataframe

You’ll learn how to use purrr, caret and dplyr to quickly create some of dataset + model combinations, store data & model objects neatly in one tibble, and post process programatically. These tools enable succinct functional programming in which a lot gets done with just a few lines of code. The data to be used is loan.csv which can be downloaded here link here. In this article we will predict the default variable which has a yes or no value.

loan <- read.csv("data_input/loan.csv")
glimpse(loan)
#> Observations: 1,000
#> Variables: 17
#> $ checking_balance     <fct> < 0 DM, 1 - 200 DM, unknown, < 0 DM, < 0 DM, u...
#> $ months_loan_duration <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48,...
#> $ credit_history       <fct> critical, good, critical, good, poor, good, go...
#> $ purpose              <fct> furniture/appliances, furniture/appliances, ed...
#> $ amount               <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948...
#> $ savings_balance      <fct> unknown, < 100 DM, < 100 DM, < 100 DM, < 100 D...
#> $ employment_duration  <fct> > 7 years, 1 - 4 years, 4 - 7 years, 4 - 7 yea...
#> $ percent_of_income    <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4...
#> $ years_at_residence   <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2...
#> $ age                  <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24...
#> $ other_credit         <fct> none, none, none, none, none, none, none, none...
#> $ housing              <fct> own, own, own, other, other, other, own, rent,...
#> $ existing_loans_count <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1...
#> $ job                  <fct> skilled, skilled, unskilled, skilled, skilled,...
#> $ dependents           <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
#> $ phone                <fct> yes, no, no, no, no, yes, no, yes, no, no, no,...
#> $ default              <fct> no, yes, no, no, yes, no, no, no, no, yes, yes...
loan <- loan %>%
  head(-10)

test <- loan %>% 
  tail(10) %>% 
  select(-default)

The loan data will be divided into test data and loan data. Test data will be used when testing the model that has been made, while the data loan will be used to model the classification.

2.1 Single Data Frame x Multiple Model

Before creating a nested dataframe, we must prepare a model that will be used first. The model to be used must be used as a function to make it easier when used in the map () function that comes from the purrr package. besides that we can set the parameters that will be used in the function. In the chunk below 2 models are created, namely the decision tree and random forest using the caret package.

#create Random forest function
RandomForestModel <- function(X, Y){
    ctrl <- trainControl(
        method = "cv",
        number = 3
    )
    train(
        x = X,
        y = Y,
        trContrl = ctrl,
        method = 'rf'
    )
}

#create decision tree function
RpartModel <- function(X, Y) {
    ctrl <- trainControl(
        method = "repeatedcv",
        number = 5
    )
    train(
        x = X,
        y = Y,
        method = 'rpart2',
        trControl = ctrl,
        tuneGrid = data.frame(maxdepth=c(2,3,4,5)),
        preProc = c('center', 'scale')
    )
}

After making a model in the form of a function, then making the model into a dataframe.

model_list <- list(rpart = RpartModel,
                   rforest = RandomForestModel) %>%
    enframe(name = 'modelName',value = 'model')
model_list
#> # A tibble: 2 x 2
#>   modelName model 
#>   <chr>     <list>
#> 1 rpart     <fn>  
#> 2 rforest   <fn>

model_list produces 2 columns, namely modelName, and model. ModelName is the name of the model, and the model contains the functions of the model.
Next the dataframe to be used replicates as many models as you want to use. The loan dataset will be replicates as much as the model used by rep () function.

nmodel <- length(model_list) #get length of model_list

nested.loan <- list(loan) %>% 
  rep(nmodel) %>% 
  enframe(name = "Id", value = "rawdata")
nested.loan
#> # A tibble: 2 x 2
#>      Id rawdata             
#>   <int> <list>              
#> 1     1 <df[,17] [990 x 17]>
#> 2     2 <df[,17] [990 x 17]>

nested.loan has 2 columns, namely Id and rawdata which contain the loan dataframe. Then rawdata will be separated into train.y which contains the default variable and train.x contains the others.

nested.loan <- nested.loan %>% 
  mutate(train.x = map(rawdata, ~select(.x, -default)),
         train.y = map(rawdata, ~.x$default))
nested.loan
#> # A tibble: 2 x 4
#>      Id rawdata              train.x              train.y    
#>   <int> <list>               <list>               <list>     
#> 1     1 <df[,17] [990 x 17]> <df[,16] [990 x 16]> <fct [990]>
#> 2     2 <df[,17] [990 x 17]> <df[,16] [990 x 16]> <fct [990]>

The next step is to join nested.loan withmodel_list using bind_cols ()

nested.loan <- nested.loan %>% 
  bind_cols(model_list)
nested.loan
#> # A tibble: 2 x 6
#>      Id rawdata              train.x              train.y     modelName model 
#>   <int> <list>               <list>               <list>      <chr>     <list>
#> 1     1 <df[,17] [990 x 17]> <df[,16] [990 x 16]> <fct [990]> rpart     <fn>  
#> 2     2 <df[,17] [990 x 17]> <df[,16] [990 x 16]> <fct [990]> rforest   <fn>

The model we have created can be used with the function invoke_map () which functions to combine functions and lists as parameters.

nested.loan <- nested.loan %>% 
  mutate(parm = map2(train.x, train.y, ~list(.x, .y)),
         model = invoke_map(model,parm))
nested.loan
#> # A tibble: 2 x 7
#>      Id rawdata           train.x           train.y    modelName model  parm    
#>   <int> <list>            <list>            <list>     <chr>     <list> <list>  
#> 1     1 <df[,17] [990 x ~ <df[,16] [990 x ~ <fct [990~ rpart     <trai~ <list [~
#> 2     2 <df[,17] [990 x ~ <df[,16] [990 x ~ <fct [990~ rforest   <trai~ <list [~

To see how well the model has been made, it can be seen from the Accuracy of each model.

nested.loan <- nested.loan %>% 
  mutate(Accuracy = map_dbl(model, ~max(.x$results$Accuracy))) %>% 
  arrange(desc(Accuracy))
nested.loan
#> # A tibble: 2 x 8
#>      Id rawdata        train.x        train.y   modelName model parm    Accuracy
#>   <int> <list>         <list>         <list>    <chr>     <lis> <list>     <dbl>
#> 1     2 <df[,17] [990~ <df[,16] [990~ <fct [99~ rforest   <tra~ <list ~    0.752
#> 2     1 <df[,17] [990~ <df[,16] [990~ <fct [99~ rpart     <tra~ <list ~    0.728

From the above results it can be seen that the random forest model produces an accuracy of 0.75 and the decission tree is 0.74. Next, we will do predict to the test data that has been made using an existing model. the data test must replicate as many models as used and then join the nested.loan data using left_join()

nested.loan <- test %>% 
  list() %>% 
  rep(nrow(nested.loan)) %>% 
  enframe(name = "Id",value = "test.x") %>% 
  left_join(nested.loan, by = "Id")
nested.loan
#> # A tibble: 2 x 9
#>      Id test.x    rawdata    train.x    train.y  modelName model parm   Accuracy
#>   <int> <list>    <list>     <list>     <list>   <chr>     <lis> <list>    <dbl>
#> 1     1 <df[,16]~ <df[,17] ~ <df[,16] ~ <fct [9~ rpart     <tra~ <list~    0.728
#> 2     2 <df[,16]~ <df[,17] ~ <df[,16] ~ <fct [9~ rforest   <tra~ <list~    0.752

Now we create a pred variable that contains results from predict

nested.loan <- nested.loan %>% 
  mutate(pred = map2(model, test.x, ~predict(.x, .y))) %>% 
  select(Id,modelName,Accuracy,pred)
glimpse(nested.loan$pred)
#> List of 2
#>  $ : Factor w/ 2 levels "no","yes": 2 1 1 2 1 1 1 1 2 2
#>  $ : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 1 1 1

2.2 Multiple Data Frame x Single Model

Now we will split loan data by checking_balance variable which has 4 levels namely < 0 DM, > 200 DM, 1 - 200 DM, and unknown

nested.split <- loan %>%
  group_by(checking_balance) %>% 
  nest(.key = "rawdata") %>% 
  mutate(train.x = map(rawdata, ~select(.x, -default)),
         train.y = map(rawdata, ~.x$default))
nested.split
#> # A tibble: 4 x 4
#> # Groups:   checking_balance [4]
#>   checking_balance rawdata             train.x             train.y    
#>   <fct>            <list>              <list>              <list>     
#> 1 < 0 DM           <tibble [270 x 16]> <tibble [270 x 15]> <fct [270]>
#> 2 1 - 200 DM       <tibble [268 x 16]> <tibble [268 x 15]> <fct [268]>
#> 3 unknown          <tibble [389 x 16]> <tibble [389 x 15]> <fct [389]>
#> 4 > 200 DM         <tibble [63 x 16]>  <tibble [63 x 15]>  <fct [63]>

The model that will be used is random forest

nested.rf <- nested.split %>% 
  mutate(param = map2(train.x, train.y, ~list(.x, .y)),
         model = invoke_map(RandomForestModel,param))
nested.rf
#> # A tibble: 4 x 6
#> # Groups:   checking_balance [4]
#>   checking_balance rawdata          train.x          train.y    param     model 
#>   <fct>            <list>           <list>           <list>     <list>    <list>
#> 1 < 0 DM           <tibble [270 x ~ <tibble [270 x ~ <fct [270~ <list [2~ <trai~
#> 2 1 - 200 DM       <tibble [268 x ~ <tibble [268 x ~ <fct [268~ <list [2~ <trai~
#> 3 unknown          <tibble [389 x ~ <tibble [389 x ~ <fct [389~ <list [2~ <trai~
#> 4 > 200 DM         <tibble [63 x 1~ <tibble [63 x 1~ <fct [63]> <list [2~ <trai~

To see how well the model is made, we can see the accuracy obtained from the model we made.

nested.rf <- nested.rf %>% 
  mutate(Accuracy = map_dbl(model, ~max(.x$results$Accuracy)))
nested.rf
#> # A tibble: 4 x 7
#> # Groups:   checking_balance [4]
#>   checking_balance rawdata       train.x       train.y   param    model Accuracy
#>   <fct>            <list>        <list>        <list>    <list>   <lis>    <dbl>
#> 1 < 0 DM           <tibble [270~ <tibble [270~ <fct [27~ <list [~ <tra~    0.641
#> 2 1 - 200 DM       <tibble [268~ <tibble [268~ <fct [26~ <list [~ <tra~    0.667
#> 3 unknown          <tibble [389~ <tibble [389~ <fct [38~ <list [~ <tra~    0.885
#> 4 > 200 DM         <tibble [63 ~ <tibble [63 ~ <fct [63~ <list [~ <tra~    0.761

2.3 Multiple Data Frame x Multiple Model

To run multi models against multi data, we must repeat data as much as the model that will be used. nested.split is a data loan that is divided based on thechecking_balance variable which contains 4 levels while the model used is 2, namely random forest and decision tree, the amount of data is 8 (4 X 2).

#replicates nested.split as much the model will be used
nested.multi <- nested.split %>% 
  list() %>%
  rep(nmodel) %>%
  bind_rows()

#replicates model_list as much the data frame
model.multi <- model_list %>% 
  list() %>% 
  rep(nrow(nested.split)) %>% 
  bind_rows() %>% 
  arrange(modelName)

#join nested.multi and model.multi
nested.multi <- nested.multi %>% 
  bind_cols(model.multi)
nested.multi
#> # A tibble: 8 x 6
#> # Groups:   checking_balance [4]
#>   checking_balance rawdata           train.x          train.y    modelName model
#>   <fct>            <list>            <list>           <list>     <chr>     <lis>
#> 1 < 0 DM           <tibble [270 x 1~ <tibble [270 x ~ <fct [270~ rforest   <fn> 
#> 2 1 - 200 DM       <tibble [268 x 1~ <tibble [268 x ~ <fct [268~ rforest   <fn> 
#> 3 unknown          <tibble [389 x 1~ <tibble [389 x ~ <fct [389~ rforest   <fn> 
#> 4 > 200 DM         <tibble [63 x 16~ <tibble [63 x 1~ <fct [63]> rforest   <fn> 
#> 5 < 0 DM           <tibble [270 x 1~ <tibble [270 x ~ <fct [270~ rpart     <fn> 
#> 6 1 - 200 DM       <tibble [268 x 1~ <tibble [268 x ~ <fct [268~ rpart     <fn> 
#> 7 unknown          <tibble [389 x 1~ <tibble [389 x ~ <fct [389~ rpart     <fn> 
#> 8 > 200 DM         <tibble [63 x 16~ <tibble [63 x 1~ <fct [63]> rpart     <fn>

Now we can modeling each data category with each model

nested.multi <- nested.multi %>% 
  mutate(param = map2(train.x, train.y, ~list(.x, .y)),
         model = invoke_map(model,param))
nested.multi
#> # A tibble: 8 x 7
#> # Groups:   checking_balance [4]
#>   checking_balance rawdata       train.x       train.y   modelName model param  
#>   <fct>            <list>        <list>        <list>    <chr>     <lis> <list> 
#> 1 < 0 DM           <tibble [270~ <tibble [270~ <fct [27~ rforest   <tra~ <list ~
#> 2 1 - 200 DM       <tibble [268~ <tibble [268~ <fct [26~ rforest   <tra~ <list ~
#> 3 unknown          <tibble [389~ <tibble [389~ <fct [38~ rforest   <tra~ <list ~
#> 4 > 200 DM         <tibble [63 ~ <tibble [63 ~ <fct [63~ rforest   <tra~ <list ~
#> 5 < 0 DM           <tibble [270~ <tibble [270~ <fct [27~ rpart     <tra~ <list ~
#> 6 1 - 200 DM       <tibble [268~ <tibble [268~ <fct [26~ rpart     <tra~ <list ~
#> 7 unknown          <tibble [389~ <tibble [389~ <fct [38~ rpart     <tra~ <list ~
#> 8 > 200 DM         <tibble [63 ~ <tibble [63 ~ <fct [63~ rpart     <tra~ <list ~
nested.multi$model[[1]]$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry, trContrl = ..1) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 8
#> 
#>         OOB estimate of  error rate: 34.44%
#> Confusion matrix:
#>     no yes class.error
#> no  90  46   0.3382353
#> yes 47  87   0.3507463

And we can see the accuracy each model

nested.multi <- nested.multi %>% 
  mutate(Accuracy = map_dbl(model, ~max(.x$results$Accuracy)))
library(lime)

model_result <- lime(x = nested.multi$train.x[[1]],nested.multi$model[[1]])
model_result$bin_continuous
#> [1] TRUE
Scroll to Top