The following presentation is produced by the team at Algoritma for its internal training This presentation is intended for a restricted audience only. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.
Outline
Why tidymodels
Matters?
- Things we think we’re doing it right
- Things we never think we could do it better
Setting the Cross-Validation Scheme using rsample
- Rethinking: Why we need validation?
- Tidy way to
rsample
-ing your dataset
Data Preprocess using recipes
- Rethinking: How we should treat train and test?
- Reproducible preprocess
recipes
Model Fitting using parsnip
- Rethinking: How many machine learning packages you used?
- One vegeta.. I mean package to rule them all:
parsnip
Model Evaluation using yardstick
- Rethinking: How we measure the goodness of our model?
- It’s always good to bring your own
yardstick
Why tidymodels
Matters?
Things we think we’re doing it right
Sample splitting
# import libs
library(plotly)
library(randomForest)
library(ranger)
library(tidyverse)
library(tidymodels)
# import additional libs
library(lubridate)
# prepare example datasets
attrition <- read_csv("data_input/attrition.csv")
# set seed
set.seed(100)
# train rows
in_train <- sample(1:nrow(attrition), nrow(attrition) * 0.8)
# check target distribution in train
prop.table(table(attrition$attrition[in_train]))
#>
#> no yes
#> 0.8367347 0.1632653
# check target distribution in test
prop.table(table(attrition$attrition[-in_train]))
#>
#> no yes
#> 0.8469388 0.1530612
Numeric scaling
# scale age in full dataset
age_scaled <- scale(attrition$age)
# check mean and standard deviation
attr(age_scaled, "scaled:center")
#> [1] 36.92381
attr(age_scaled, "scaled:scale")
#> [1] 9.135373
# scale age in train dataset
age_train_scaled <- scale(attrition$age[in_train])
# check mean and standard deviation
attr(age_train_scaled, "scaled:center")
#> [1] 37.0085
attr(age_train_scaled, "scaled:scale")
#> [1] 9.070969
# scale age in test dataset
age_test_scaled <- scale(attrition$age[-in_train])
# check mean and standard deviation
attr(age_test_scaled, "scaled:center")
#> [1] 36.58503
attr(age_test_scaled, "scaled:scale")
#> [1] 9.396712
Things we never think we could do it better
How we see model performance
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 1036 197
#> yes 197 40
#>
#> Accuracy : 0.732
#> 95% CI : (0.7085, 0.7545)
#> No Information Rate : 0.8388
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.009
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.16878
#> Specificity : 0.84023
#> Pos Pred Value : 0.16878
#> Neg Pred Value : 0.84023
#> Prevalence : 0.16122
#> Detection Rate : 0.02721
#> Detection Prevalence : 0.16122
#> Balanced Accuracy : 0.50450
#>
#> 'Positive' Class : yes
#>
How we use Receiver Operating Curve
Setting the Cross-Validation Scheme using rsample
Rethinking: Why we need validation?
Tidy way to rsample
-ing your dataset
rsample
is part of tidymodels
that could help us in splitting or resampling or machine learning dataset.
There are so many splitting and resampling approach provided by rsample
–as you could see in its full function references page. In this introduction, we will use two most general function:
initial_split()
:
Simple train and test splitting.vfold_cv()
:
k-fold splitting, with optional repetition argument.
Initial splitting
Initial random sampling for splitting train and test could be done using initial_split()
:
# set seed
set.seed(100)
# create initial split
splitted <- initial_split(attrition, prop = 0.8)
# check train dataset
training(splitted)
#> # A tibble: 1,177 x 35
#> attrition age business_travel daily_rate department distance_from_h~
#> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 yes 41 travel_rarely 1102 sales 1
#> 2 no 49 travel_frequen~ 279 research_~ 8
#> 3 yes 37 travel_rarely 1373 research_~ 2
#> 4 no 33 travel_frequen~ 1392 research_~ 3
#> 5 no 27 travel_rarely 591 research_~ 2
#> 6 no 32 travel_frequen~ 1005 research_~ 2
#> 7 no 59 travel_rarely 1324 research_~ 3
#> 8 no 30 travel_rarely 1358 research_~ 24
#> 9 no 38 travel_frequen~ 216 research_~ 23
#> 10 no 36 travel_rarely 1299 research_~ 27
#> # ... with 1,167 more rows, and 29 more variables: education <dbl>,
#> # education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> # environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> # job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> # job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> # monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> # over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> # relationship_satisfaction <dbl>, standard_hours <dbl>,
#> # stock_option_level <dbl>, total_working_years <dbl>,
#> # training_times_last_year <dbl>, work_life_balance <dbl>,
#> # years_at_company <dbl>, years_in_current_role <dbl>,
#> # years_since_last_promotion <dbl>, years_with_curr_manager <dbl>
# check test dataset
testing(splitted)
#> # A tibble: 293 x 35
#> attrition age business_travel daily_rate department distance_from_h~
#> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 no 29 travel_rarely 1389 research_~ 21
#> 2 no 38 travel_rarely 371 research_~ 2
#> 3 yes 36 travel_rarely 1218 sales 9
#> 4 yes 34 travel_rarely 699 research_~ 6
#> 5 no 53 travel_rarely 1282 research_~ 5
#> 6 yes 24 travel_rarely 813 research_~ 1
#> 7 no 36 travel_rarely 852 research_~ 5
#> 8 no 27 travel_rarely 1240 research_~ 2
#> 9 yes 41 travel_rarely 1360 research_~ 12
#> 10 yes 48 travel_rarely 626 research_~ 1
#> # ... with 283 more rows, and 29 more variables: education <dbl>,
#> # education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> # environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> # job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> # job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> # monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> # over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> # relationship_satisfaction <dbl>, standard_hours <dbl>,
#> # stock_option_level <dbl>, total_working_years <dbl>,
#> # training_times_last_year <dbl>, work_life_balance <dbl>,
#> # years_at_company <dbl>, years_in_current_role <dbl>,
#> # years_since_last_promotion <dbl>, years_with_curr_manager <dbl>
But sometimes, simple random sampling is not enough:
# target distribution in full dataset
prop.table(table(attrition$attrition))
#>
#> no yes
#> 0.8387755 0.1612245
# target distribution in train dataset
prop.table(table(training(splitted)$attrition))
#>
#> no yes
#> 0.8462192 0.1537808
# target distribution in test dataset
prop.table(table(testing(splitted)$attrition))
#>
#> no yes
#> 0.8088737 0.1911263
This is where we need strata
argument to use stratified random sampling:
# set seed
set.seed(100)
# create stratified initial split
splitted <- initial_split(attrition, prop = 0.8, strata = "attrition")
# target distribution in full dataset
prop.table(table(attrition$attrition))
#>
#> no yes
#> 0.8387755 0.1612245
# target distribution in train dataset
prop.table(table(training(splitted)$attrition))
#>
#> no yes
#> 0.8385726 0.1614274
# target distribution in test dataset
prop.table(table(testing(splitted)$attrition))
#>
#> no yes
#> 0.8395904 0.1604096
Cross-sectional resampling
To use k-fold validation splits–and optionally, with repetition–we could use vfold_cv()
:
# set seed
set.seed(100)
# create k-fold splits with repetition
resampled <- vfold_cv(attrition, v = 3, repeats = 2, strata = "attrition")
# quick check
resampled
#> # 3-fold cross-validation repeated 2 times using stratification
#> # A tibble: 6 x 3
#> splits id id2
#> <named list> <chr> <chr>
#> 1 <split [980/490]> Repeat1 Fold1
#> 2 <split [980/490]> Repeat1 Fold2
#> 3 <split [980/490]> Repeat1 Fold3
#> 4 <split [980/490]> Repeat2 Fold1
#> 5 <split [980/490]> Repeat2 Fold2
#> 6 <split [980/490]> Repeat2 Fold3
Each train and test dataset are stored in splits
column. We could use analysis()
and assessment()
to get the train and test dataset, consecutively:
# check train dataset on an example split
analysis(resampled$splits[[1]])
#> # A tibble: 980 x 35
#> attrition age business_travel daily_rate department distance_from_h~
#> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 no 49 travel_frequen~ 279 research_~ 8
#> 2 yes 37 travel_rarely 1373 research_~ 2
#> 3 no 33 travel_frequen~ 1392 research_~ 3
#> 4 no 27 travel_rarely 591 research_~ 2
#> 5 no 59 travel_rarely 1324 research_~ 3
#> 6 no 36 travel_rarely 1299 research_~ 27
#> 7 no 29 travel_rarely 153 research_~ 15
#> 8 no 31 travel_rarely 670 research_~ 26
#> 9 no 34 travel_rarely 1346 research_~ 19
#> 10 no 22 non_travel 1123 research_~ 16
#> # ... with 970 more rows, and 29 more variables: education <dbl>,
#> # education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> # environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> # job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> # job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> # monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> # over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> # relationship_satisfaction <dbl>, standard_hours <dbl>,
#> # stock_option_level <dbl>, total_working_years <dbl>,
#> # training_times_last_year <dbl>, work_life_balance <dbl>,
#> # years_at_company <dbl>, years_in_current_role <dbl>,
#> # years_since_last_promotion <dbl>, years_with_curr_manager <dbl>
# check test dataset on an example split
assessment(resampled$splits[[1]])
#> # A tibble: 490 x 35
#> attrition age business_travel daily_rate department distance_from_h~
#> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 yes 41 travel_rarely 1102 sales 1
#> 2 no 32 travel_frequen~ 1005 research_~ 2
#> 3 no 30 travel_rarely 1358 research_~ 24
#> 4 no 38 travel_frequen~ 216 research_~ 23
#> 5 no 35 travel_rarely 809 research_~ 16
#> 6 yes 28 travel_rarely 103 research_~ 24
#> 7 no 29 travel_rarely 1389 research_~ 21
#> 8 no 32 travel_rarely 334 research_~ 5
#> 9 no 38 travel_rarely 371 research_~ 2
#> 10 yes 36 travel_rarely 1218 sales 9
#> # ... with 480 more rows, and 29 more variables: education <dbl>,
#> # education_field <chr>, employee_count <dbl>, employee_number <dbl>,
#> # environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> # job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> # job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> # monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>,
#> # over_time <chr>, percent_salary_hike <dbl>, performance_rating <dbl>,
#> # relationship_satisfaction <dbl>, standard_hours <dbl>,
#> # stock_option_level <dbl>, total_working_years <dbl>,
#> # training_times_last_year <dbl>, work_life_balance <dbl>,
#> # years_at_company <dbl>, years_in_current_role <dbl>,
#> # years_since_last_promotion <dbl>, years_with_curr_manager <dbl>
Data Preprocess using recipes
Rethinking: How do we should treat train and test?
Reproducible preprocess recipes
recipes
is part of tidymodels
that could help us in making a reproducible data preprocess.
There are so many data preprocess approach provided by recipes
–as you could see in its full function references page . In this introduction, we will use several preprocess steps related to our example dataset.
There are several steps that we could apply to our dataset–some are very fundamental and sometimes mandatory, but some are also tuneable:
step_rm()
:
Manually remove unused columns.step_nzv()
:
Automatically filter near-zero varianced columns.step_string2factor()
:
Manually convert tofactor
columns.step_downsample()
:
Downsampling step to balancing target’s class distribution (tuneable).step_center()
:
Normalize the mean ofnumeric
column(s) to zero (tuneable).step_scale()
:
Normalize the standard deviation ofnumeric
column(s) to one (tuneable).step_pca()
:
Shrinknumeric
column(s) to several PCA components (tuneable).
Designing your first preprocess recipes
- Initiate a recipe using
recipe()
- Define your formula in the first argument.
- Supply a template dataset in
data
argument.
- Pipe to every needed
step_*()
–always remember to put every step in proper consecutive manner. - After finished with every needed
step_*()
, pipe toprep()
function to train your recipe- It will automatically convert all
character
columns tofactor
; - If you used
step_string2factor()
, don’t forget to specifystrings_as_factors = FALSE
- It will train the recipe to the specified dataset in the
recipe()
’sdata
argument; - If you want to train to other dataset, you can supply the new dataset to
training
argument, and set thefresh
argument toTRUE
- It will automatically convert all
Let’s see an example of defining a recipe:
# define preprocess recipe from train dataset
rec <- recipe(attrition ~ ., data = training(splitted)) %>%
step_rm(employee_count, employee_number) %>%
step_nzv(all_predictors()) %>%
step_string2factor(all_nominal(), -attrition) %>%
step_string2factor(attrition, levels = c("yes", "no")) %>%
step_downsample(attrition, ratio = 1/1, seed = 100) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
step_pca(all_numeric(), threshold = 0.85) %>%
prep(strings_as_factors = FALSE)
# quick check
rec
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 34
#>
#> Training data contained 1177 data points and no missing data.
#>
#> Operations:
#>
#> Variables removed employee_count, employee_number [trained]
#> Sparse, unbalanced variable filter removed over_18, standard_hours [trained]
#> Factor variables from business_travel, ... [trained]
#> Factor variables from attrition [trained]
#> Down-sampling based on attrition [trained]
#> Centering for age, daily_rate, ... [trained]
#> Scaling for age, daily_rate, ... [trained]
#> PCA extraction with age, daily_rate, ... [trained]
There are two ways of obtaining the result from our recipe:
-
juice()
:
Extract preprocessed dataset fromprep()
-ed recipe. Normally, we will use this function to get preprocessed train dataset. -
bake()
:
Apply a recipe to new dataset. Normally, we use we will use this function to preprocess new dataset, such as test dataset, or prediction dataset.
These are some example on how to get preprocessed train and test dataset:
# get preprocessed train dataset
data_train <- juice(rec)
# quick check
data_train
#> # A tibble: 380 x 22
#> business_travel department education_field gender job_role
#> <fct> <fct> <fct> <fct> <fct>
#> 1 travel_rarely sales life_sciences female sales_e~
#> 2 travel_rarely research_~ other male laborat~
#> 3 travel_rarely research_~ life_sciences male laborat~
#> 4 travel_rarely research_~ medical male researc~
#> 5 travel_frequen~ research_~ life_sciences female researc~
#> 6 travel_rarely sales technical_degr~ male sales_r~
#> 7 travel_rarely research_~ medical male researc~
#> 8 travel_rarely sales marketing male sales_r~
#> 9 travel_rarely research_~ technical_degr~ female researc~
#> 10 travel_rarely research_~ life_sciences male laborat~
#> # ... with 370 more rows, and 17 more variables: marital_status <fct>,
#> # over_time <fct>, attrition <fct>, PC01 <dbl>, PC02 <dbl>, PC03 <dbl>,
#> # PC04 <dbl>, PC05 <dbl>, PC06 <dbl>, PC07 <dbl>, PC08 <dbl>,
#> # PC09 <dbl>, PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>
# get preprocessed test dataset
data_test <- bake(rec, testing(splitted))
# quick check
data_test
#> # A tibble: 293 x 22
#> attrition business_travel department education_field gender job_role
#> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 no travel_rarely research_~ life_sciences female manufac~
#> 2 no non_travel research_~ other female manufac~
#> 3 yes travel_rarely sales life_sciences male sales_r~
#> 4 no travel_rarely research_~ other female manager
#> 5 no travel_rarely sales marketing male sales_e~
#> 6 no travel_rarely sales marketing female sales_r~
#> 7 no travel_rarely research_~ other male laborat~
#> 8 yes travel_rarely research_~ life_sciences male laborat~
#> 9 no travel_frequen~ research_~ medical female laborat~
#> 10 no travel_rarely research_~ life_sciences male researc~
#> # ... with 283 more rows, and 16 more variables: marital_status <fct>,
#> # over_time <fct>, PC01 <dbl>, PC02 <dbl>, PC03 <dbl>, PC04 <dbl>,
#> # PC05 <dbl>, PC06 <dbl>, PC07 <dbl>, PC08 <dbl>, PC09 <dbl>,
#> # PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>
Model Fitting using parsnip
Rethinking: How many machine learning packages you used?
One vegeta.. I mean package to rule them all: parsnip
parsnip
is part of tidymodels
that could help us in model fitting and prediction flows.
There are so many models supported by parsnip
–as you could see in its full model list . In this introduction, we will use random forest as an example model.
There are two part of defining a model that should be noted:
-
Defining model’s specification:
In this part, we need to define the model’s specification, such asmtry
andtrees
for random forest, through model specific functions. For example, you can userand_forest()
to define a random forest specification. Make sure to check full model list to see every model and its available arguments. -
Defining model’s engine:
In this part, we need to define the model’s engine, which determines the package we will use to fit our model. This part could be done usingset_engine()
function. Note that in addition to defining which package we want to use as our engine, we could also passing package specific arguments to this function.
This is an example of defining a random forest model using randomForest::randomForest()
as our engine:
# set-up model specification
model_spec <- rand_forest(
mode = "classification",
mtry = ncol(data_train) - 2,
trees = 500,
min_n = 15
)
# set-up model engine
model_engine <- set_engine(
object = model_spec,
engine = "randomForest"
)
# quick check
model_engine
#> Random Forest Model Specification (classification)
#>
#> Main Arguments:
#> mtry = ncol(data_train) - 2
#> trees = 500
#> min_n = 15
#>
#> Computational engine: randomForest
To fit our model, we have two options:
- Formula interface
- X-Y interface
Note that some packages are behaving differently inside formula and x-y interface. For example, randomForest::randomForest()
would convert all of our categorical variables into dummy variables in formula interface, but not in x-y interface.
Fit using formula interface using fit()
function:
# fit the model
model <- fit(
object = model_engine,
formula = attrition ~ .,
data = data_train
)
# quick check
model
#> parsnip model object
#>
#>
#> Call:
#> randomForest(x = as.data.frame(x), y = y, ntree = ~500, mtry = ~ncol(data_train) - 2, nodesize = ~15)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 20
#>
#> OOB estimate of error rate: 28.95%
#> Confusion matrix:
#> yes no class.error
#> yes 141 49 0.2578947
#> no 61 129 0.3210526
Or through x-y interface using fit_xy()
function:
# fit the model
model <- fit_xy(
object = model_engine,
x = select(data_train, -attrition),
y = select(data_train, attrition)
)
# quick check
model
#> parsnip model object
#>
#>
#> Call:
#> randomForest(x = as.data.frame(x), y = y, ntree = ~500, mtry = ~ncol(data_train) - 2, nodesize = ~15)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 20
#>
#> OOB estimate of error rate: 27.37%
#> Confusion matrix:
#> yes no class.error
#> yes 138 52 0.2736842
#> no 52 138 0.2736842
In this workflow, it should be relatively easy to change the model engine.
Let’s try to fit a same model specification, but now using ranger::ranger()
:
# set-up other model engine
model_engine <- set_engine(
object = model_spec,
engine = "ranger",
seed = 100,
num.threads = parallel::detectCores() / 2,
importance = "impurity"
)
# quick check
model_engine
#> Random Forest Model Specification (classification)
#>
#> Main Arguments:
#> mtry = ncol(data_train) - 2
#> trees = 500
#> min_n = 15
#>
#> Engine-Specific Arguments:
#> seed = 100
#> num.threads = parallel::detectCores()/2
#> importance = impurity
#>
#> Computational engine: ranger
Now let’s try to fit the model, and see the result:
# fit the model
model <- fit(
object = model_engine,
formula = attrition ~ .,
data = data_train
)
# quick check
model
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#> ranger::ranger(formula = formula, data = data, mtry = ~ncol(data_train) - 2, num.trees = ~500, min.node.size = ~15, seed = ~100, num.threads = ~parallel::detectCores()/2, importance = ~"impurity", verbose = FALSE, probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 500
#> Sample size: 380
#> Number of independent variables: 21
#> Mtry: 20
#> Target node size: 15
#> Variable importance mode: impurity
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.1945066
Notice that ranger::ranger()
doesn’t behave differently between fit()
and fit_xy()
:
# fit the model
model <- fit_xy(
object = model_engine,
x = select(data_train, -attrition),
y = select(data_train, attrition)
)
# quick check
model
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#> ranger::ranger(formula = formula, data = data, mtry = ~ncol(data_train) - 2, num.trees = ~500, min.node.size = ~15, seed = ~100, num.threads = ~parallel::detectCores()/2, importance = ~"impurity", verbose = FALSE, probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 500
#> Sample size: 380
#> Number of independent variables: 21
#> Mtry: 20
#> Target node size: 15
#> Variable importance mode: impurity
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.1945066
To get the prediction, we could use predict()
as usual–but note that it would return a tidied tibble
instead of a vector
, as in type = "class"
cases, or a raw data.frame
, as in type = "prob"
cases.
In this way, the prediction results would be very convenient for further usage, such as simple recombining with the original dataset:
# get prediction on test
predicted <- data_test %>%
bind_cols(predict(model, data_test)) %>%
bind_cols(predict(model, data_test, type = "prob"))
# quick check
predicted %>%
select(attrition, matches(".pred"))
#> # A tibble: 293 x 4
#> attrition .pred_class .pred_yes .pred_no
#> <fct> <fct> <dbl> <dbl>
#> 1 no no 0.340 0.660
#> 2 no no 0.358 0.642
#> 3 yes yes 0.509 0.491
#> 4 no no 0.0884 0.912
#> 5 no no 0.185 0.815
#> 6 no no 0.448 0.552
#> 7 no no 0.127 0.873
#> 8 yes yes 0.748 0.252
#> 9 no no 0.147 0.853
#> 10 no yes 0.680 0.320
#> # ... with 283 more rows
Model Evaluation using yardstick
Rethinking: How do we measure the goodness of our model?
It’s always good to bring your own yardstick
yardstick
is part of tidymodels
that could help us in calculating model performance metrics.
There are so many metrics available by yardstick
–as you could see in its full function references page . In this introduction, we will calculate some model performance metrics for classification task as an example.
There are two ways of calculating model performance metrics, which differ in its input and output:
tibble
approach:
We pass a dataset containing thetruth
andestimate
to the function, and it will return atibble
containing the results, e.g.,precision()
function.vector
approach:
We pass a vector as thetruth
and a vector as theestimate
to the function, and it will return avector
which show the results, e.g.,precision_vec()
function.
Note that some function, like conf_mat()
, only accept tibble
approach, since it is not returned a vector
of length one.
Let’s start by reporting the confusion matrix:
# show confusion matrix
predicted %>%
conf_mat(truth = attrition, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Now, to calculate the performance metrics, let’s try to use the tibble
approach–which also support group_by
:
# calculate accuracy
predicted %>%
accuracy(attrition, .pred_class)
#> # A tibble: 1 x 3
#> .metric .estimator .estimate
#> <chr> <chr> <dbl>
#> 1 accuracy binary 0.710
# calculate accuracy by group
predicted %>%
group_by(department) %>%
accuracy(attrition, .pred_class) %>%
ungroup()
#> # A tibble: 3 x 4
#> department .metric .estimator .estimate
#> <fct> <chr> <chr> <dbl>
#> 1 human_resources accuracy binary 0.636
#> 2 research_development accuracy binary 0.760
#> 3 sales accuracy binary 0.605
Or using vector
approach, which is more flexible in general:
# metrics summary
predicted %>%
summarise(
accuracy = accuracy_vec(attrition, .pred_class),
sensitivity = sens_vec(attrition, .pred_class),
specificity = spec_vec(attrition, .pred_class),
precision = precision_vec(attrition, .pred_class)
)
#> # A tibble: 1 x 4
#> accuracy sensitivity specificity precision
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.710 0.723 0.707 0.321
# metrics summary by group
predicted %>%
group_by(department) %>%
summarise(
accuracy = accuracy_vec(attrition, .pred_class),
sensitivity = sens_vec(attrition, .pred_class),
specificity = spec_vec(attrition, .pred_class),
precision = precision_vec(attrition, .pred_class)
) %>%
ungroup()
#> # A tibble: 3 x 5
#> department accuracy sensitivity specificity precision
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 human_resources 0.636 1 0.556 0.333
#> 2 research_development 0.760 0.677 0.776 0.362
#> 3 sales 0.605 0.786 0.569 0.262
Sometimes the model performance metrics could also improving the models final results. For example, through Receiver Operating Curve, we could assess the probability threshold effect to sensitivity and specificity:
predicted %>%
roc_curve(attrition, .pred_yes) %>%
autoplot()
And, since it’s returning a tibble
, we could do further data wrangling to help us see it more clearly:
# get roc curve data on test dataset
pred_test_roc <- predicted %>%
roc_curve(attrition, .pred_yes)
# quick check
pred_test_roc
#> # A tibble: 295 x 3
#> .threshold specificity sensitivity
#> <dbl> <dbl> <dbl>
#> 1 -Inf 0 1
#> 2 0.0422 0 1
#> 3 0.0487 0.00407 1
#> 4 0.0568 0.00813 1
#> 5 0.0674 0.0122 1
#> 6 0.0685 0.0163 1
#> 7 0.0705 0.0203 1
#> 8 0.0884 0.0244 1
#> 9 0.0922 0.0285 1
#> 10 0.0945 0.0325 1
#> # ... with 285 more rows
With some ggplot2
:
# tidying
pred_test_roc <- pred_test_roc %>%
mutate_if(~ is.numeric(.), ~ round(., 4)) %>%
gather(metric, value, -.threshold)
# plot sensitivity-specificity trade-off
p <- ggplot(pred_test_roc, aes(x = .threshold, y = value)) +
geom_line(aes(colour = metric)) +
labs(x = "Probability Threshold to be Classified as Positive", y = "Value", colour = "Metrics") +
theme_minimal()
and plotly
magic, it would be perfect:
# convert to plotly
ggplotly(p)