About Dimensionality Reduction
If you are familiar enough with data, sometimes you are faced with too many predictor variables that make the computation so heavy. Let us say, you are challenged to predict employee in your company will resign or not while the variables are the level of satisfaction on work, number of project, average monthly hours, time spend at the company, etc. You are facing so many predictor that took so long for training your model. One way to speed up your training process is by reducing the dimension that can make the computation less heavy.
To do the dimensionality reduction, the techniques divide into two ways:
- Feature Elimination
- Feature Extraction
Feature Elimination
Feature elimination is when you select the variable that is influence your prediction, and throw away the variable that has no contribution to your prediction. In the case of prediction of resigning employee or not, for example, you only choose the variable that is influencing the employee resignation.
Generally, you choose the variables based on your expertise on experiencing the employee resignation. Besides, you can use several statistical technique to this, like using variance, spearman, anova, etc. Unfortunately, this article will not explain what kinds of feature elimination here, since we want to focus on one of feature extraction methods.
Feature Extraction
Feature extraction is a technique that you create new variable based on your existing variable. Let us say, for the employee resignation case, given we have 10 predictor variables to predict the employee will resign or not. So, in feature extraction, we create 10 new variables based on the 10 given variable. One of the techniques to do this is called Principal Component Analysis (PCA).
Principal Component Analysis
The Principal Component Analysis (PCA) is a statistical method to reduce the dimension of the data by extracting the variables and leave the variables that has least information about something that we predicted \(\hat{y}\).
Then, when you should using PCA instead of other method?1
- When you want to reduce the dimension/variable, but you dont care what variables that is completely remove
- When you want to ensure your variables are not correlate of one another
- When you are comfortable enough to make your predictor variables less interpretable
In this article, we want to apply Principal Component Analysis on two datasets, the Online Shopper Intention and Breast Cancer dataset. The aim of this article is to compare how powerful PCA when applied in the data that has less correlate of one another and the dataset that has higher correlation of each variables. Now, let us start with the Online shopper intention dataset first.
Applying PCA on Online Shopper Intention Dataset
We will explore PCA on the data that has variables correlation and no correlation of one another. We will start with the correlated variables first.
In this use case, we use Online Shoppers Intention dataset. The data is downloaded from kaggle. The data consists of various Information related to customer behavior in online shopping websites. Let us say, we want to predict a customer will generate the revenue of our business or not.
We will create two models here, the first is the model that the predictors is using PCA, and the second is the model without PCA in the preprocessing data.
Load the library needed.
# data wrangling
library(tidyverse)
library(GGally)
# data preprocessing
library(recipes)
# modelling
library(rsample)
library(caret)
# measure time consumption
library(tictoc)
Load the shopper intention dataset to our environment.
shopper_intention <- read_csv("pca_use_case/online_shoppers_intention.csv")
The data is shown as seen below:
glimpse(shopper_intention)
#> Rows: 12,330
#> Columns: 18
#> $ Administrative <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
#> $ Administrative_Duration <dbl> 0, 0, -1, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, 0...
#> $ Informational <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ Informational_Duration <dbl> 0, 0, -1, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, 0...
#> $ ProductRelated <dbl> 1, 2, 1, 2, 10, 19, 1, 1, 2, 3, 3, 16, 7, 6...
#> $ ProductRelated_Duration <dbl> 0.000000, 64.000000, -1.000000, 2.666667, 6...
#> $ BounceRates <dbl> 0.200000000, 0.000000000, 0.200000000, 0.05...
#> $ ExitRates <dbl> 0.200000000, 0.100000000, 0.200000000, 0.14...
#> $ PageValues <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ SpecialDay <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4, 0.0, 0.8...
#> $ Month <chr> "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "...
#> $ OperatingSystems <fct> 1, 2, 4, 3, 3, 2, 2, 1, 2, 2, 1, 1, 1, 2, 3...
#> $ Browser <fct> 1, 2, 1, 2, 3, 2, 4, 2, 2, 4, 1, 1, 1, 5, 2...
#> $ Region <fct> 1, 1, 9, 2, 1, 1, 3, 1, 2, 1, 3, 4, 1, 1, 3...
#> $ TrafficType <dbl> 1, 2, 3, 4, 4, 3, 3, 5, 3, 2, 3, 3, 3, 3, 3...
#> $ VisitorType <chr> "Returning_Visitor", "Returning_Visitor", "...
#> $ Weekend <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FA...
#> $ Revenue <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
The dataset has 12,330 observations and 18 variables. Hence, we have 17 predictor variables and 1 target variable to predict. Here are the description of the variables in the dataset:
Administrative
= Administrative ValueAdministrative_Duration
= Duration in Administrative PageInformational
= Informational ValueInformational_Duration
= Duration in Informational PageProductRelated
= Product Related ValueProductRelated_Duration
= Duration in Product Related PageBounceRates
= percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session.ExitRates
= Exit rate of a web pagePageValuesPage
= values of each web pageSpecialDaySpecial
= days like valentine etcMonth
= Month of the yearOperatingSystems
= Operating system usedBrowser
= Browser usedRegion
= Region of the userTrafficType
= Traffic TypeVisitorType
= Types of VisitorWeekend
= Weekend or notRevenue
= Revenue will be generated or not
Based on its description, it looks like our variables are in its correct data type. Besides, we want to check the correlation between each numerical predictor variable using visualization in ggcorr() function from GGally package.
ggcorr(select_if(shopper_intention, is.numeric),
label = T,
hjust = 1,
layout.exp = 3)
It looks like we have several variables that has correlation of one another, but the correlation is not quite high. Now, let us do the cross validation to split the data into train and test. We will split the data into 80% to be our training dataset and 20% to be our testing dataset.
RNGkind(sample.kind = "Rounding")
set.seed(417)
splitted <- initial_split(data = shopper_intention, prop = 0.8, strata = "Revenue")
Now, let us check the proportion of our target variable in the train dataset, that is Revenue
.
prop.table(table(training(splitted)$Revenue))
#>
#> FALSE TRUE
#> 0.8452103 0.1547897
Based on the proportion of our target variable, only 15.4% of our visitor in the website purchase any goods, hence it resulting revenue for the shop. Besides, the proportion of our target variable is imbalance
Then, let us check is there any missing value on each variable.
colSums(is.na(shopper_intention))
#> Administrative Administrative_Duration Informational
#> 14 14 14
#> Informational_Duration ProductRelated ProductRelated_Duration
#> 14 14 14
#> BounceRates ExitRates PageValues
#> 14 14 0
#> SpecialDay Month OperatingSystems
#> 0 0 0
#> Browser Region TrafficType
#> 0 0 0
#> VisitorType Weekend Revenue
#> 0 0 0
Based on the output above, our data has several missing value (NA), but the number of missing value still 5% of our data. Hence, we can remove the NA in our preprocessing step.
The Revenue on Online Wesite Prediction with PCA
In this article, we do the several preprocessing step using recipe()
function from recipe package. We store all of our preprocessing in step_*()
function, including the PCA step. The syntax of PCA in our recipe is stored as step_pca(all_numeric(), threshold = 0.90)
. The syntax means, we use the numeric variable only and take the 90% of cummulative variance of the data, hence the threshold is set by 0.90.
rec <- recipe(Revenue~., training(splitted)) %>%
step_naomit(all_predictors()) %>% # remove the observation that has NA (missing value)
step_nzv(all_predictors()) %>% # remove the near zero variance variable
step_upsample(Revenue, ratio = 1, seed = 100) %>% # balancing the target variable proportion
step_center(all_numeric()) %>% # make all the predictor has 0 mean
step_scale(all_numeric()) %>% # make the predictor has 1 sd
step_pca(all_numeric(), threshold = 0.90) %>% # do the pca by using 90% variance of the data
prep() # prepare the recipe
train <- juice(rec)
test <- bake(rec, testing(splitted))
Now, peek our train dataset after the preprocessing applied.
head(train)
#> # A tibble: 6 x 13
#> Month OperatingSystems Browser Region VisitorType Weekend Revenue PC1
#> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl>
#> 1 Feb 1 1 1 Returning_~ FALSE FALSE -3.80
#> 2 Feb 2 2 1 Returning_~ FALSE FALSE -1.70
#> 3 Feb 3 3 1 Returning_~ TRUE FALSE -1.30
#> 4 Feb 2 2 1 Returning_~ FALSE FALSE -1.09
#> 5 Feb 2 4 3 Returning_~ FALSE FALSE -3.83
#> 6 Feb 1 2 1 Returning_~ TRUE FALSE -3.74
#> # ... with 5 more variables: PC2 <dbl>, PC3 <dbl>, PC4 <dbl>, PC5 <dbl>,
#> # PC6 <dbl>
We can see in train dataset above, we have 1 target variable, 6 categorical predictor and 6 new numeric PCs (the result of 90% variance of PCA) predictor that will be trained in to our model.
In our first model– the model that use PCA in the preprocessing step, we want to build a random forest model using 5 fold validation and 3 repeats to predict if the visitor of our website will generate the revenue or not. Besides, we use tic()
and toc()
function to measure the time elapsed while running the random forest model.
RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model <- train(Revenue ~ ., data = train, method = "rf", trControl = ctrl)
toc()
After running the model, the time consumed to build the model is 1608.41 or around 26 minutes.
Then, we use the model to predict the test dataset.
prediction_pca <- predict(model, test)
Now, lets check the accuracy of the model built on a confusion matrix.
confusionMatrix(prediction_pca, test$Revenue, positive = "TRUE")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 1954 170
#> TRUE 128 211
#>
#> Accuracy : 0.879
#> 95% CI : (0.8655, 0.8916)
#> No Information Rate : 0.8453
#> P-Value [Acc > NIR] : 1.064e-06
#>
#> Kappa : 0.5155
#>
#> Mcnemar's Test P-Value : 0.01755
#>
#> Sensitivity : 0.55381
#> Specificity : 0.93852
#> Pos Pred Value : 0.62242
#> Neg Pred Value : 0.91996
#> Prevalence : 0.15469
#> Detection Rate : 0.08567
#> Detection Prevalence : 0.13764
#> Balanced Accuracy : 0.74616
#>
#> 'Positive' Class : TRUE
#>
The Revenue on Online Wesite Prediction without PCA
Now, we want to compare the result of model that use PCA in the preprocessing step with the model that use the same preprocessing step, but without PCA. Now, let us make the recipe first.
rec2 <- recipe(Revenue~., training(splitted)) %>%
step_naomit(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_upsample(Revenue, ratio = 1, seed = 100) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
train2 <- juice(rec2)
test2 <- bake(rec2, testing(splitted))
Then, take a look at our training data
head(train2)
#> # A tibble: 6 x 17
#> Administrative Administrative_~ Informational Informational_D~ ProductRelated
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.787 -0.532 -0.452 -0.287 -0.725
#> 2 -0.787 -0.532 -0.452 -0.287 -0.706
#> 3 -0.787 -0.532 -0.452 -0.287 -0.551
#> 4 -0.787 -0.532 -0.452 -0.287 -0.378
#> 5 -0.787 -0.538 -0.452 -0.294 -0.725
#> 6 -0.500 -0.538 -0.452 -0.294 -0.725
#> # ... with 12 more variables: ProductRelated_Duration <dbl>, BounceRates <dbl>,
#> # ExitRates <dbl>, PageValues <dbl>, Month <fct>, OperatingSystems <fct>,
#> # Browser <fct>, Region <fct>, TrafficType <dbl>, VisitorType <fct>,
#> # Weekend <fct>, Revenue <fct>
As seen above, we use 16 predictors, means there are no variable that has been removed (unlike the predictors in our previous model). Next, apply the random forest algorithm with the exact same model tuning to compare the time comsume and the accuracy of the model.
RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model2 <- train(Revenue ~ ., data = train2, method = "rf", trControl = ctrl)
toc()
prediction <- predict(model2, test2)
confusionMatrix(prediction, test$Revenue, positive = "TRUE")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction FALSE TRUE
#> FALSE 1949 143
#> TRUE 133 238
#>
#> Accuracy : 0.8879
#> 95% CI : (0.8748, 0.9001)
#> No Information Rate : 0.8453
#> P-Value [Acc > NIR] : 6.578e-10
#>
#> Kappa : 0.5669
#>
#> Mcnemar's Test P-Value : 0.588
#>
#> Sensitivity : 0.62467
#> Specificity : 0.93612
#> Pos Pred Value : 0.64151
#> Neg Pred Value : 0.93164
#> Prevalence : 0.15469
#> Detection Rate : 0.09663
#> Detection Prevalence : 0.15063
#> Balanced Accuracy : 0.78040
#>
#> 'Positive' Class : TRUE
#>
Result:
– The online shopper data has a few variables that correlated of one another.
– The two model above (the model with PCA and not) has almost similar in accuracy (with PCA 0.87, without PCA 0.88)
– The time consuming while using PCA is 1608.41 sec elapsed and without PCA is 1936.95. Then we can save 328.54 seconds or +-/ 5 minutes of time when using PCA.
Now, how if we have larger numeric predictor and stronger correlation?
Applying PCA in Breast Cancer Dataset
In this section, we will use breast cancer dataset. Let us say, we want to predict a patient is diagnosed with malignant or benign cancer. The predictor variables are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The data itself can be downloaded from UCI Machine Learning Repository
Here, we will create two models, the first is the model that the predictors is using PCA, and the second is the model without PCA in the preprocessing data.
cancer <- read_csv("pca_use_case/breast-cancer-wisconsin-data/data.csv")
Now, let us take a look at our data.
glimpse(cancer)
#> Rows: 569
#> Columns: 33
#> $ id <dbl> 842302, 842517, 84300903, 84348301, 8435840...
#> $ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M"...
#> $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12....
#> $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 1...
#> $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.5...
#> $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477....
#> $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030...
#> $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280...
#> $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800...
#> $ `concave points_mean` <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430...
#> $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2...
#> $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883...
#> $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3...
#> $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8...
#> $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3...
#> $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, ...
#> $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0...
#> $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0...
#> $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688...
#> $ `concave points_se` <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0...
#> $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756...
#> $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0...
#> $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 2...
#> $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 2...
#> $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103....
#> $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741....
#> $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1...
#> $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5...
#> $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000...
#> $ `concave points_worst` <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250...
#> $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3...
#> $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678...
#> $ X33 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
The dataset has 569 observations and 33 variables (32 predictors, 1 response variable). While, the variable description is explained below:
ID
= ID numberdiagnosis
= (M = malignant, B = benign)
Ten real-valued features are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area – 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” – 1)
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
From the data, the id
and X33
variable did not help us to predict the diagnosis of cancer patient. Let us remove it from the data.
cancer <- cancer %>%
select(-c(X33, id))
Then, let us check is there any missing value on each variable.
colSums(is.na(cancer))
#> diagnosis radius_mean texture_mean
#> 0 0 0
#> perimeter_mean area_mean smoothness_mean
#> 0 0 0
#> compactness_mean concavity_mean concave points_mean
#> 0 0 0
#> symmetry_mean fractal_dimension_mean radius_se
#> 0 0 0
#> texture_se perimeter_se area_se
#> 0 0 0
#> smoothness_se compactness_se concavity_se
#> 0 0 0
#> concave points_se symmetry_se fractal_dimension_se
#> 0 0 0
#> radius_worst texture_worst perimeter_worst
#> 0 0 0
#> area_worst smoothness_worst compactness_worst
#> 0 0 0
#> concavity_worst concave points_worst symmetry_worst
#> 0 0 0
#> fractal_dimension_worst
#> 0
Now, let us check the correlation of each variable below to make sure the are the variables high correlated of one another rather than the online shopper data.
ggcorr(cancer, label = T, hjust = 1, label_size = 2, layout.exp = 6)
From the visualization above, the data has higher correlated between each variable than the online shopper data.
RNGkind(sample.kind = "Rounding")
set.seed(100)
idx <- initial_split(cancer, prop = 0.8,strata = "diagnosis")
cancer_train <- training(idx)
cancer_test <- testing(idx)
The Breast Cancer Prediction with PCA
Using breast cancer dataset, we first want to build a model using PCA in the preprocessing approach. Still, we use the 90% of the variance of the data.
rec_cancer_pca <- recipe(diagnosis~., cancer_train) %>%
step_naomit(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
step_pca(all_numeric(), threshold = 0.9) %>%
prep()
cancer_train_pca <- juice(rec_cancer_pca)
cancer_test_pca <- bake(rec_cancer_pca, cancer_test)
After applying PCA in breast cancer dataset, here are the number of variable that we will be using.
head(cancer_train_pca)
#> # A tibble: 6 x 8
#> diagnosis PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 M -9.15 -1.58 0.900 -3.89 -0.655 1.33 -2.06
#> 2 M -2.33 3.98 0.528 -1.04 0.584 -0.0925 -0.104
#> 3 M -7.21 -10.1 3.13 -0.868 -2.31 2.92 -1.35
#> 4 M -3.91 2.22 -1.51 -2.72 0.833 -1.28 0.829
#> 5 M -2.18 2.86 1.66 -0.242 -0.108 -0.196 0.194
#> 6 M -3.16 -3.26 3.06 0.153 -1.55 0.542 0.213
From the table above, we use 7 PCs instead of 30 predictor variables. Now lets train the data to the model.
RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_cancer_pca <- train(diagnosis ~ ., data = cancer_train_pca, method = "rf", trControl = ctrl)
toc()
The time consumed when using PCA is 4.88 seconds on training the dataset. Next, we can predict the test dataset from the model_cancer_pca
.
pred_cancer_pca <- predict(model_cancer_pca, cancer_test_pca)
Now, let us check the condusion matrix of our model using confusion matrix.
confusionMatrix(pred_cancer_pca, cancer_test_pca$diagnosis, positive = "M")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction B M
#> B 70 3
#> M 1 39
#>
#> Accuracy : 0.9646
#> 95% CI : (0.9118, 0.9903)
#> No Information Rate : 0.6283
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9235
#>
#> Mcnemar's Test P-Value : 0.6171
#>
#> Sensitivity : 0.9286
#> Specificity : 0.9859
#> Pos Pred Value : 0.9750
#> Neg Pred Value : 0.9589
#> Prevalence : 0.3717
#> Detection Rate : 0.3451
#> Detection Prevalence : 0.3540
#> Balanced Accuracy : 0.9572
#>
#> 'Positive' Class : M
#>
The accuracy of the model for the test data while using PCA is 0.96. Then, we will build a model that’s not using PCA to be compared with.
The Breast Cancer Prediction without PCA
In this part, we want to classify the breast cancer patient diagnosis without PCA in the preprocessing step. Let us create a recipe for it.
rec_cancer <- recipe(diagnosis~., cancer_train) %>%
step_naomit(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
cancer_train <- juice(rec_cancer)
cancer_test <- bake(rec_cancer, cancer_test)
Here, we want to create a model using the same algorithm and specification to be compared with the previous model.
tic()
set.seed(100)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_cancer <- train(diagnosis ~ ., data = cancer_train, method = "rf", trControl = ctrl)
toc()
The time consuming without PCA in processing data is 11.21 seconds, means it is almost 3x faster than the model that is using PCA in the preprocessing data.
pred_cancer <- predict(model_cancer, cancer_test)
How about the accuracy of the model? is the accuracy greater while we do not use PCA? Now let us check it using confusion matrix below
confusionMatrix(pred_cancer, cancer_test$diagnosis, positive = "M")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction B M
#> B 71 5
#> M 0 37
#>
#> Accuracy : 0.9558
#> 95% CI : (0.8998, 0.9855)
#> No Information Rate : 0.6283
#> P-Value [Acc > NIR] : < 2e-16
#>
#> Kappa : 0.9029
#>
#> Mcnemar's Test P-Value : 0.07364
#>
#> Sensitivity : 0.8810
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9342
#> Prevalence : 0.3717
#> Detection Rate : 0.3274
#> Detection Prevalence : 0.3274
#> Balanced Accuracy : 0.9405
#>
#> 'Positive' Class : M
#>
Turns out, based on the confusion matrix above, the accuracy is lesser (0.95) than using PCA (0.96). Hence, the PCA really works well on the data that has high dimensional data and high correlated of variables2.
Result:
- The breast cancer dataset has many variables that correlated of one another.
- The two model above (the model with PCA and not) has almost similar in accuracy (with PCA 0.96, without PCA 0.95)
- The time consuming while using PCA is 4.88 sec elapsed and without PCA is 11.21. Then we can save 6.33 seconds or while using PCA the computation is more than 2x faster than the model without PCA.
Conclusion
Principal Component Analysis (PCA) is very useful to speed up the computation by reducing the dimensionality of the data. Plus, when you have high dimensionality with high correlated variable of one another, the PCA can improve the accuracy of classification model. Unfortunately, while using PCA, you make your machine learning model less interpretable. Also, PCA will only be applied in your dataset when your dataset contains more than one numerical variable that you want to reduce its dimension.