Data Wars: Episode IV

Data wrangling sometimes could become very tedious. No matter what language do you prefer: R, python, or even SQL, the process of preprocessing your dataset is generally very time consuming. But, this is not the case if you know how to properly use packages included in tidyverse.

The term of tidyverse is actually referring to a set of packages that you’ll find very helpful in any data analysis tasks; many of them are already popular among R users, like dplyr, ggplot2, and lubridate. These packages are intended to work side-by-side; they share the same API style, and sharing a common data representation format. But another things that often missed is that the tidyverse itself is also a package! The tidyverse package is intended to make your environment set-up process faster; you only need to install tidyverse library to install the bundled packages, and just load the tidyverse library to load all the core packages.

In this articles (or maybe I should say, in this episode), I will cover the basic usage of core packages in tidyverse, using (of course) starwars dataset which included along with dplyr library.

There will be several sequels, and even prequels, for data wrangling using R and tidyverse tutorial–or to put it simple, data wars :grin:–so stay tuned!

Enter the tidyverse

As I mentioned before, you can install all of the included packages just by installing tidyverse. So let’s start by installing the package first:

install.packages("tidyverse")

If you are in an interactive session, you could see which packages are installed along with tidyverse. Just to make sure, let’s load the library:

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.5.3
#> -- Attaching packages ---------------------------- tidyverse 1.2.1 --
#> v ggplot2 3.2.1     v purrr   0.3.3
#> v tibble  2.1.3     v dplyr   0.8.3
#> v tidyr   1.0.0     v stringr 1.4.0
#> v readr   1.3.1     v forcats 0.4.0
#> Warning: package 'ggplot2' was built under R version 3.5.3
#> Warning: package 'tibble' was built under R version 3.5.3
#> Warning: package 'tidyr' was built under R version 3.5.3
#> Warning: package 'readr' was built under R version 3.5.3
#> Warning: package 'purrr' was built under R version 3.5.3
#> Warning: package 'dplyr' was built under R version 3.5.3
#> Warning: package 'stringr' was built under R version 3.5.3
#> Warning: package 'forcats' was built under R version 3.5.3
#> -- Conflicts ------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

When importing the library, it give you messages about which packages that are attached to your session. But it only give you information about the core packages that fully loaded to your session, while actually tidyverse also attaching some great libraries in the background. You can see them all through sessionInfo():

sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17763)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_Indonesia.1252  LC_CTYPE=English_Indonesia.1252   
#> [3] LC_MONETARY=English_Indonesia.1252 LC_NUMERIC=C                      
#> [5] LC_TIME=English_Indonesia.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.3    
#> [5] readr_1.3.1     tidyr_1.0.0     tibble_2.1.3    ggplot2_3.2.1  
#> [9] tidyverse_1.2.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.2       cellranger_1.1.0 pillar_1.4.2     compiler_3.5.2  
#>  [5] tools_3.5.2      zeallot_0.1.0    digest_0.6.22    lubridate_1.7.4 
#>  [9] jsonlite_1.6     evaluate_0.14    lifecycle_0.1.0  nlme_3.1-141    
#> [13] gtable_0.3.0     lattice_0.20-38  pkgconfig_2.0.3  rlang_0.4.1     
#> [17] cli_1.1.0        rstudioapi_0.10  yaml_2.2.0       haven_2.1.1     
#> [21] blogdown_0.16    xfun_0.10        withr_2.1.2      xml2_1.2.2      
#> [25] httr_1.4.1       knitr_1.25       hms_0.5.1        generics_0.0.2  
#> [29] vctrs_0.2.0      grid_3.5.2       tidyselect_0.2.5 glue_1.3.1      
#> [33] R6_2.4.0         readxl_1.3.1     rmarkdown_1.16   bookdown_0.14   
#> [37] modelr_0.1.5     magrittr_1.5     backports_1.1.5  scales_1.0.0    
#> [41] htmltools_0.4.0  rvest_0.3.4      assertthat_0.2.1 colorspace_1.4-1
#> [45] stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0    broom_0.5.2     
#> [49] crayon_1.3.4

As you can see, it actually loads many packages in the background. Don’t worry, I will not cover all of them in this article; But I will cover them for sure in later sequels and prequels :grin:. If you really curious about those packages, I suggest you to checkout the detailed list at the tidyverse official documentations.

Use the standard data representation: tibble

The first step in learning the tidy data wrangling is to be familiar with its standard data representation. Before I explain it in detail, let me give you a quick example by printing the iris dataset:

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

and compare the print output with starwars dataset from dplyr:

starwars
#> # A tibble: 87 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke~    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl~ red             33   <NA>  
#>  4 Dart~    202   136 none       white      yellow          41.9 male  
#>  5 Leia~    150    49 brown      light      brown           19   female
#>  6 Owen~    178   120 brown, gr~ light      blue            52   male  
#>  7 Beru~    165    75 brown      light      blue            47   female
#>  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  9 Bigg~    183    84 black      light      brown           24   male  
#> 10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male  
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

There are sooo many differences between the two outputs that I could elaborate more, but let’s focus on several important points:

  • When printing iris dataset, I had to call it through head() to avoid printing the whole dataset. This is not the case for starwars dataset
  • Base print output is just as-is, not giving some useful information
  • If you see the details in starwars’ print output, it is showing that it can store a more complex object, like list in this case

Why they behave differently? The answer lies in their class:

class(iris)
#> [1] "data.frame"

class(starwars)
#> [1] "tbl_df"     "tbl"        "data.frame"

The starwars dataset class is showing that it’s actually a tibble; which often referred as tbl or tbl_* when related to a specific data class. A tibble is a little bit different than your usual data.frame. I already outlined the main differences above, but in additition to that, it should be noted that tibble is the main data representation across the tidyverse; and also some extension that adopting the tidy principle, e.g., tbl_ts from tsibble for tidy time series data representation.

There are so many other advantages if you use tibble. For example, if you want to subset just one column, a simple data.frame will failed you again:

iris[, "Species"]
#>   [1] setosa     setosa     setosa     setosa     setosa     setosa    
#>   [7] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [13] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [19] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [25] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [31] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [37] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [43] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [49] setosa     setosa     versicolor versicolor versicolor versicolor
#>  [55] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [61] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [67] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [73] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [79] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [85] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [91] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [97] versicolor versicolor versicolor versicolor virginica  virginica 
#> [103] virginica  virginica  virginica  virginica  virginica  virginica 
#> [109] virginica  virginica  virginica  virginica  virginica  virginica 
#> [115] virginica  virginica  virginica  virginica  virginica  virginica 
#> [121] virginica  virginica  virginica  virginica  virginica  virginica 
#> [127] virginica  virginica  virginica  virginica  virginica  virginica 
#> [133] virginica  virginica  virginica  virginica  virginica  virginica 
#> [139] virginica  virginica  virginica  virginica  virginica  virginica 
#> [145] virginica  virginica  virginica  virginica  virginica  virginica 
#> Levels: setosa versicolor virginica

It will drop its “table” representation and convert the data into a vector, instead of a data.frame with one column. But this is not the case with tibble:

iris_tbl <- as_tibble(iris)

iris_tbl[, "Species"]
#> # A tibble: 150 x 1
#>    Species
#>    <fct>  
#>  1 setosa 
#>  2 setosa 
#>  3 setosa 
#>  4 setosa 
#>  5 setosa 
#>  6 setosa 
#>  7 setosa 
#>  8 setosa 
#>  9 setosa 
#> 10 setosa 
#> # ... with 140 more rows

Why this is justified as a more tidy way? because if you recall the R’s standard way to access a specific part from an object is actually using $:

iris_tbl$Species
#>   [1] setosa     setosa     setosa     setosa     setosa     setosa    
#>   [7] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [13] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [19] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [25] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [31] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [37] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [43] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [49] setosa     setosa     versicolor versicolor versicolor versicolor
#>  [55] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [61] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [67] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [73] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [79] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [85] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [91] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [97] versicolor versicolor versicolor versicolor virginica  virginica 
#> [103] virginica  virginica  virginica  virginica  virginica  virginica 
#> [109] virginica  virginica  virginica  virginica  virginica  virginica 
#> [115] virginica  virginica  virginica  virginica  virginica  virginica 
#> [121] virginica  virginica  virginica  virginica  virginica  virginica 
#> [127] virginica  virginica  virginica  virginica  virginica  virginica 
#> [133] virginica  virginica  virginica  virginica  virginica  virginica 
#> [139] virginica  virginica  virginica  virginica  virginica  virginica 
#> [145] virginica  virginica  virginica  virginica  virginica  virginica 
#> Levels: setosa versicolor virginica

which gives you the exact results.

So my concluding notes are:

  • Use tibble format whenever it’s possible. I rarely found any situation that using tbl or tbl_* object is not acceptable, because it is still inheriting the standard data.frame class.
  • When working a specific case, e.g., time series dataset, or big data, always look if there are any tidy representation for the data; see tsibble and sparklyr for example.

Easier data manipulation with dplyr

The dplyr package is probably the most important part of tidyverse. It brings so many convenient utilities for data manipulation, particularly for some basic tasks; yet sometimes, could be troublesome if you only use base R functionalities.

Among many advanced features inside dplyr, I suggest you to start with the cores data manipulation functions:

  • select(): for selecting a subset of data by column
  • filter(): for filtering a subset of data by row using condition
  • arrange(): for re-arranging the rows order
  • mutate(): for mutating (:confused:) a new or existing variables

As implied by my description, most of dplyr functions could be described by its names; except mutate, which is very likely to be confusing at the first time.

Those function can be naturally combined with the core aggregation functions:

  • group_by(): for specifying that following operations should be done by group(s)
  • summarise(): for obtaining the summarise of the data by the specified group(s).

These functions probably would cover most of your data wrangling tasks; and if done in a proper tidy way, then you’ll find that data wrangling process will be easier. Generally, most of these workflow could be combined through a pipeline using %>% (pipe) operator to make it tidier.

Let’s try out these features using starwars dataset. The first step–and I think the most important step–that we should do is setting our output goal. For this tutorial, let’s try to find out how the height and mass differ across homeworld and gender for human species. So we will transform the dataset from:

starwars
#> # A tibble: 87 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke~    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl~ red             33   <NA>  
#>  4 Dart~    202   136 none       white      yellow          41.9 male  
#>  5 Leia~    150    49 brown      light      brown           19   female
#>  6 Owen~    178   120 brown, gr~ light      blue            52   male  
#>  7 Beru~    165    75 brown      light      blue            47   female
#>  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  9 Bigg~    183    84 black      light      brown           24   male  
#> 10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male  
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

into this format:

# the output goal:

#> # A tibble: 21 x 5
#>    homeworld    gender  mass height   bmi
#>    <chr>        <chr>  <dbl>  <dbl> <dbl>
#>  1 Bestine IV   male   110     180   61.1
#>  2 Tatooine     male   100.    184.  54.0
#>  3 <NA>         male    89     193   46.1
#>  4 Tatooine     female  75     164   45.5
#>  5 Bespin       male    79     175   45.1
#>  6 Naboo        male    80     178.  45.0
#>  7 Corellia     male    78.5   175   44.9
#>  8 Haruun Kal   male    84     188   44.7
#>  9 Socorro      male    79     177   44.6
#> 10 Concord Dawn male    79     183   43.2
#> # ... with 11 more rows

Column subsetting using select()

select() is probably the first function you will call in every data manipulation tasks. By using select(), you can narrow down the number variables to fewer variables–which are what you truly need.

The select() usage (and also every core functions in dplyr) is very straighforward. You just need to pass the dataset and select the column name you want to select:

select(starwars, name)
#> # A tibble: 87 x 1
#>    name              
#>    <chr>             
#>  1 Luke Skywalker    
#>  2 C-3PO             
#>  3 R2-D2             
#>  4 Darth Vader       
#>  5 Leia Organa       
#>  6 Owen Lars         
#>  7 Beru Whitesun lars
#>  8 R5-D4             
#>  9 Biggs Darklighter 
#> 10 Obi-Wan Kenobi    
#> # ... with 77 more rows

It’s very easy right? Before we continue to our objective, there are some tips you need to consider.

Chaining operation using %>%

When we doing a complex data manipulation using dplyr, I suggest to use %>% operator to make our code tidier:

starwars %>%
  select(name)
#> # A tibble: 87 x 1
#>    name              
#>    <chr>             
#>  1 Luke Skywalker    
#>  2 C-3PO             
#>  3 R2-D2             
#>  4 Darth Vader       
#>  5 Leia Organa       
#>  6 Owen Lars         
#>  7 Beru Whitesun lars
#>  8 R5-D4             
#>  9 Biggs Darklighter 
#> 10 Obi-Wan Kenobi    
#> # ... with 77 more rows

This is because %>% operator could help you to make a chaining operation. The pipe operator will pass the results from its left-hand side, to the first parameter in the function of its right-hand side. So you could make a chaining operation like this:

starwars %>%
  select(name) %>%
  anotherfuns(<arguments>) %>%
  anotherfuns(<arguments>) %>%
  <continue the chain as long as necessary> %>%
  anotherfuns(<arguments>)

So as long as anotherfuns you chained could accept the previous result as its first argument, you can chaining the operation as long as you want.

Checking the variables using glimpse()

Now back to our objective. Remember that we want to make a summary of height and mass of human species across homeworld and gender. So we actually need only those few variables to go on. Still, remembering the name of those variables could be frustating sometimes, so I suggest to use glimpse() function before you jump into variable selection:

glimpse(starwars)
#> Observations: 87
#> Variables: 13
#> $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
#> $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
#> $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
#> $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
#> $ gender     <chr> "male", NA, NA, "male", "female", "male", "female",...
#> $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
#> $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
#> $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
#> $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
#> $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...

The glimpse() function is practically the simplified version of str(); but it only print the most general information like the variable names, class, etc. Another alternative is to use the base colnames(), but it doesn’t provide additional information like glimpse(), so I suggest to use the later.

From glimpse() output, now we have the list of variable names and can be sure that we have to select species, homeworld, gender, mass, height:

starwars %>%
  select(name, species, homeworld, gender, mass, height)
#> # A tibble: 87 x 6
#>    name               species homeworld gender  mass height
#>    <chr>              <chr>   <chr>     <chr>  <dbl>  <int>
#>  1 Luke Skywalker     Human   Tatooine  male      77    172
#>  2 C-3PO              Droid   Tatooine  <NA>      75    167
#>  3 R2-D2              Droid   Naboo     <NA>      32     96
#>  4 Darth Vader        Human   Tatooine  male     136    202
#>  5 Leia Organa        Human   Alderaan  female    49    150
#>  6 Owen Lars          Human   Tatooine  male     120    178
#>  7 Beru Whitesun lars Human   Tatooine  female    75    165
#>  8 R5-D4              Droid   Tatooine  <NA>      32     97
#>  9 Biggs Darklighter  Human   Tatooine  male      84    183
#> 10 Obi-Wan Kenobi     Human   Stewjon   male      77    182
#> # ... with 77 more rows

Now the dataset looks a lot simpler 😀

Filtering specific observation using filter()

You already see that a simple column selection could help you in starting the data wrangling process. But row/observation selection is sometimes a compulsory step rather than just a helper step. For example, you want to exclude or include specific observations which inheriting some specific traits. If you recall our objective, and to be more specific at “for human species”, then we are absolutely need to filter our dataset to only containing human species.

As aforementioned, filter() function is also very easy to use: just pass the dataset object as the first argument, then you can pass the condition(s) as the following argument(s).

To declare the condition, the most common way is to use the base condition testing:

  • == and != for testing equal, and not equal condition
  • > or >= for testing greater than, and greater than or equal
  • < or <= for testing less than, and less than or equal

Let’s try to filter observation with mass less than or equal to 50 from starwars dataset:

filter(starwars, mass <= 50)
#> # A tibble: 13 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 R2-D2     96    32 <NA>       white, bl~ red               33 <NA>  
#>  2 Leia~    150    49 brown      light      brown             19 female
#>  3 R5-D4     97    32 <NA>       white, red red               NA <NA>  
#>  4 Yoda      66    17 white      green      brown            896 male  
#>  5 Wick~     88    20 brown      brown      brown              8 male  
#>  6 Sebu~    112    40 none       grey, red  orange            NA male  
#>  7 Dud ~     94    45 none       blue, grey yellow            NA male  
#>  8 Adi ~    184    50 none       dark       blue              NA female
#>  9 Barr~    166    50 black      yellow     blue              40 female
#> 10 Ratt~     79    15 none       grey, blue unknown           NA male  
#> 11 Wat ~    193    48 none       green, gr~ unknown           NA male  
#> 12 Sly ~    178    48 none       pale       white             NA female
#> 13 Padm~    165    45 brown      light      brown             46 female
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

It might be looks very simple, but let’s dive deeper on how condition testing works to see its full potential.

Behind the scene of condition testing

Before I explain the details, let’s take a look at the first 5 mass value from starwars:

head(starwars$mass)
#> [1]  77  75  32 136  49 120

If we test the condition of <= 50 to above data, we will actually get this output:

head(starwars$mass) <= 50
#> [1] FALSE FALSE  TRUE FALSE  TRUE FALSE

So basically filtering condition is picking up rows with the TRUE value; and it will discard the rest. Based on this mechanism, we could also use other helper functions to test a condition, e.g., when we want to see rows with NA values, we can use is.na() function:

filter(starwars, is.na(mass))
#> # A tibble: 28 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Wilh~    180    NA auburn, g~ fair       blue              64 male  
#>  2 Mon ~    150    NA auburn     fair       blue              48 female
#>  3 Arve~     NA    NA brown      fair       brown             NA male  
#>  4 Fini~    170    NA blond      fair       blue              91 male  
#>  5 Rugo~    206    NA none       green      orange            NA male  
#>  6 Ric ~    183    NA brown      fair       blue              NA male  
#>  7 Watto    137    NA black      blue, grey yellow            NA male  
#>  8 Quar~    183    NA black      dark       brown             62 male  
#>  9 Shmi~    163    NA black      fair       brown             72 female
#> 10 Bib ~    180    NA none       pale       pink              NA male  
#> # ... with 18 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

or if we want to keep observations with non-NA value, then we could create a negation using !:

filter(starwars, !is.na(mass))
#> # A tibble: 59 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke~    172    77 blond      fair       blue            19   male  
#>  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2     96    32 <NA>       white, bl~ red             33   <NA>  
#>  4 Dart~    202   136 none       white      yellow          41.9 male  
#>  5 Leia~    150    49 brown      light      brown           19   female
#>  6 Owen~    178   120 brown, gr~ light      blue            52   male  
#>  7 Beru~    165    75 brown      light      blue            47   female
#>  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
#>  9 Bigg~    183    84 black      light      brown           24   male  
#> 10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male  
#> # ... with 49 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

Quick-check on conditions availability

Just like when we check the variables using glimpse(), we also need to check our variables to make sure about any available options to be filtered. For our case, we already know that we want to get only human observations, but straighly jump into filter() is very unwise:

starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "human")
#> # A tibble: 0 x 6
#> # ... with 6 variables: name <chr>, species <chr>, homeworld <chr>,
#> #   gender <chr>, mass <dbl>, height <int>

It seems like there is no "human" species, but actually it’s just because we specified a wrong condition. For categorical variable like species, I really suggest you to check it using unique() first:

unique(starwars$species)
#>  [1] "Human"          "Droid"          "Wookiee"        "Rodian"        
#>  [5] "Hutt"           "Yoda's species" "Trandoshan"     "Mon Calamari"  
#>  [9] "Ewok"           "Sullustan"      "Neimodian"      "Gungan"        
#> [13] NA               "Toydarian"      "Dug"            "Zabrak"        
#> [17] "Twi'lek"        "Vulptereen"     "Xexto"          "Toong"         
#> [21] "Cerean"         "Nautolan"       "Tholothian"     "Iktotchi"      
#> [25] "Quermian"       "Kel Dor"        "Chagrian"       "Geonosian"     
#> [29] "Mirialan"       "Clawdite"       "Besalisk"       "Kaminoan"      
#> [33] "Aleena"         "Skakoan"        "Muun"           "Togruta"       
#> [37] "Kaleesh"        "Pau'an"

Which shows that it’s actually "Human". For numerical variables, you can use summary() to help you quick-check the value range:

summary(starwars$height)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    66.0   167.0   180.0   174.4   191.0   264.0       6

After we quick-check the values within our variable of interest, it’s safe now to call filter() function:

starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human")
#> # A tibble: 35 x 6
#>    name               species homeworld gender  mass height
#>    <chr>              <chr>   <chr>     <chr>  <dbl>  <int>
#>  1 Luke Skywalker     Human   Tatooine  male      77    172
#>  2 Darth Vader        Human   Tatooine  male     136    202
#>  3 Leia Organa        Human   Alderaan  female    49    150
#>  4 Owen Lars          Human   Tatooine  male     120    178
#>  5 Beru Whitesun lars Human   Tatooine  female    75    165
#>  6 Biggs Darklighter  Human   Tatooine  male      84    183
#>  7 Obi-Wan Kenobi     Human   Stewjon   male      77    182
#>  8 Anakin Skywalker   Human   Tatooine  male      84    188
#>  9 Wilhuff Tarkin     Human   Eriadu    male      NA    180
#> 10 Han Solo           Human   Corellia  male      80    180
#> # ... with 25 more rows

Manipulating variables using mutate()

When I mentioned mutate() terms, most of you that are not familiar with dplyr probably not sure on its meaning. This function is basically helping us to manipulating the variables inside our dataframe. For example, you could change the units of mass and height index:

mutate(starwars, mass = mass * 100, height = height / 100)
#> # A tibble: 87 x 13
#>    name  height  mass hair_color skin_color eye_color birth_year gender
#>    <chr>  <dbl> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#>  1 Luke~   1.72  7700 blond      fair       blue            19   male  
#>  2 C-3PO   1.67  7500 <NA>       gold       yellow         112   <NA>  
#>  3 R2-D2   0.96  3200 <NA>       white, bl~ red             33   <NA>  
#>  4 Dart~   2.02 13600 none       white      yellow          41.9 male  
#>  5 Leia~   1.5   4900 brown      light      brown           19   female
#>  6 Owen~   1.78 12000 brown, gr~ light      blue            52   male  
#>  7 Beru~   1.65  7500 brown      light      blue            47   female
#>  8 R5-D4   0.97  3200 <NA>       white, red red             NA   <NA>  
#>  9 Bigg~   1.83  8400 black      light      brown           24   male  
#> 10 Obi-~   1.82  7700 auburn, w~ fair       blue-gray       57   male  
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

But it doesn’t limited to manipulating the existing variables; you could also make a new variable using this function. Say, we want to add Body Mass Index (BMI) to our current analysis, we could create it using mutate():

starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100)
#> # A tibble: 35 x 7
#>    name               species homeworld gender  mass height   bmi
#>    <chr>              <chr>   <chr>     <chr>  <dbl>  <int> <dbl>
#>  1 Luke Skywalker     Human   Tatooine  male      77    172  44.8
#>  2 Darth Vader        Human   Tatooine  male     136    202  67.3
#>  3 Leia Organa        Human   Alderaan  female    49    150  32.7
#>  4 Owen Lars          Human   Tatooine  male     120    178  67.4
#>  5 Beru Whitesun lars Human   Tatooine  female    75    165  45.5
#>  6 Biggs Darklighter  Human   Tatooine  male      84    183  45.9
#>  7 Obi-Wan Kenobi     Human   Stewjon   male      77    182  42.3
#>  8 Anakin Skywalker   Human   Tatooine  male      84    188  44.7
#>  9 Wilhuff Tarkin     Human   Eriadu    male      NA    180  NA  
#> 10 Han Solo           Human   Corellia  male      80    180  44.4
#> # ... with 25 more rows

This function is very versatile, so play around with your creativity here :sunglasses:

Data aggregation using group_by and summarise

We are reaching the final parts! Let’s recall the part of our objective that we haven’t done yet: “across the homeworld and gender”. This particular phrase is indicating that we want each of our observation to representing its homeworld and gender instead of each character’s name like what we currently have. We can achieve this using group_by() and summarise() function.

To declare a group, you just need to specify the name of variables that you want to set as group identifier:

starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100) %>%
  group_by(homeworld, gender)
#> # A tibble: 35 x 7
#> # Groups:   homeworld, gender [21]
#>    name               species homeworld gender  mass height   bmi
#>    <chr>              <chr>   <chr>     <chr>  <dbl>  <int> <dbl>
#>  1 Luke Skywalker     Human   Tatooine  male      77    172  44.8
#>  2 Darth Vader        Human   Tatooine  male     136    202  67.3
#>  3 Leia Organa        Human   Alderaan  female    49    150  32.7
#>  4 Owen Lars          Human   Tatooine  male     120    178  67.4
#>  5 Beru Whitesun lars Human   Tatooine  female    75    165  45.5
#>  6 Biggs Darklighter  Human   Tatooine  male      84    183  45.9
#>  7 Obi-Wan Kenobi     Human   Stewjon   male      77    182  42.3
#>  8 Anakin Skywalker   Human   Tatooine  male      84    188  44.7
#>  9 Wilhuff Tarkin     Human   Eriadu    male      NA    180  NA  
#> 10 Han Solo           Human   Corellia  male      80    180  44.4
#> # ... with 25 more rows

Then the tibble will automatically give you information about which group is active.

Note that the order of specified groups is also matters. The function will process the group operation from the right-most variable. So the best practice here to set the order from major to minor group:

... %>%
  group_by(<major group>, <minor group>, <definitely the minorest group>) %>%
  ...

After we set the group, we can use summarise() to do the aggregation. This function is almost identic to how mutate works, but the output will be aggregated value by the group we specified; note that if we don’t specify any group, then it will aggregate the value to one observation that representing our dataset.

Let’s try to take the mean of height, mass, and bmi variables:

starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100) %>%
  group_by(homeworld, gender) %>%
  summarise(
    mass = mean(mass, na.rm = TRUE), # use na.rm since there are
    height = mean(height, na.rm = TRUE), # some missing values
    bmi = mean(bmi, na.rm = TRUE)
  )
#> # A tibble: 21 x 5
#> # Groups:   homeworld [16]
#>    homeworld    gender  mass height   bmi
#>    <chr>        <chr>  <dbl>  <dbl> <dbl>
#>  1 Alderaan     female  49     150   32.7
#>  2 Alderaan     male    79     190.  42.0
#>  3 Bespin       male    79     175   45.1
#>  4 Bestine IV   male   110     180   61.1
#>  5 Chandrila    female NaN     150  NaN  
#>  6 Concord Dawn male    79     183   43.2
#>  7 Corellia     male    78.5   175   44.9
#>  8 Coruscant    female NaN     167  NaN  
#>  9 Coruscant    male   NaN     170  NaN  
#> 10 Eriadu       male   NaN     180  NaN  
#> # ... with 11 more rows

When we done with group operations, don’t forget to set the group declaration to off. The most straighforward ways to do this is by passing the result to ungroup():

starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100) %>%
  group_by(homeworld, gender) %>%
  summarise(
    mass = mean(mass, na.rm = TRUE),
    height = mean(height, na.rm = TRUE),
    bmi = mean(bmi, na.rm = TRUE)
  ) %>%
  ungroup()
#> # A tibble: 21 x 5
#>    homeworld    gender  mass height   bmi
#>    <chr>        <chr>  <dbl>  <dbl> <dbl>
#>  1 Alderaan     female  49     150   32.7
#>  2 Alderaan     male    79     190.  42.0
#>  3 Bespin       male    79     175   45.1
#>  4 Bestine IV   male   110     180   61.1
#>  5 Chandrila    female NaN     150  NaN  
#>  6 Concord Dawn male    79     183   43.2
#>  7 Corellia     male    78.5   175   44.9
#>  8 Coruscant    female NaN     167  NaN  
#>  9 Coruscant    male   NaN     170  NaN  
#> 10 Eriadu       male   NaN     180  NaN  
#> # ... with 11 more rows

Re-ordering observation using arrange()

Our last results actually enough for showing the summary of mass and height across homeworld and gender. But it doesn’t give us some basic informations, like, which homeworld-gender combination has the lowest or higher bmi.

Fortunately, dplyr provided arrange() function for this specific task. By default, this function will arrange the specified variable in ascending mode:

starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100) %>%
  group_by(homeworld, gender) %>%
  summarise(
    mass = mean(mass, na.rm = TRUE),
    height = mean(height, na.rm = TRUE),
    bmi = mean(bmi, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  arrange(bmi)
#> # A tibble: 21 x 5
#>    homeworld    gender  mass height   bmi
#>    <chr>        <chr>  <dbl>  <dbl> <dbl>
#>  1 Naboo        female  45     162.  27.3
#>  2 Alderaan     female  49     150   32.7
#>  3 Serenno      male    80     193   41.5
#>  4 Alderaan     male    79     190.  42.0
#>  5 Stewjon      male    77     182   42.3
#>  6 Kamino       male    78.2   183   42.7
#>  7 Concord Dawn male    79     183   43.2
#>  8 Socorro      male    79     177   44.6
#>  9 Haruun Kal   male    84     188   44.7
#> 10 Corellia     male    78.5   175   44.9
#> # ... with 11 more rows

And if you want descending sorting, you just need to wrap the variable using desc():

starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100) %>%
  group_by(homeworld, gender) %>%
  summarise(
    mass = mean(mass, na.rm = TRUE),
    height = mean(height, na.rm = TRUE),
    bmi = mean(bmi, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  arrange(desc(bmi))
#> # A tibble: 21 x 5
#>    homeworld    gender  mass height   bmi
#>    <chr>        <chr>  <dbl>  <dbl> <dbl>
#>  1 Bestine IV   male   110     180   61.1
#>  2 Tatooine     male   100.    184.  54.0
#>  3 <NA>         male    89     193   46.1
#>  4 Tatooine     female  75     164   45.5
#>  5 Bespin       male    79     175   45.1
#>  6 Naboo        male    80     178.  45.0
#>  7 Corellia     male    78.5   175   44.9
#>  8 Haruun Kal   male    84     188   44.7
#>  9 Socorro      male    79     177   44.6
#> 10 Concord Dawn male    79     183   43.2
#> # ... with 11 more rows

Now we know that the females from Naboo have the lowest mean BMI, while the males from Bestine have the highest mean BMI 😀

Concluding remarks

In this article, you already see how easy–and tidy!–it is to do data wrangling tasks using tibble and dplyr. But we just scratched the surface of tidyverse; there are so many package and functionalities available for us to explore!

For the next episode, we will try more realistic cases, and see how tidyverse packages could help us in a more complex data wrangling tasks. So stay tuned on this series! :sunglasses:

Here is a tidyverse logo for you:

#> * __  _    __   .    o           *  . 
#>  / /_(_)__/ /_ ___  _____ _______ ___ 
#> / __/ / _  / // / |/ / -_) __(_-</ -_)
#> \__/_/\_,_/\_, /|___/\__/_/ /___/\__/ 
#>      *  . /___/      o      .       *
Scroll to Top