Data wrangling sometimes could become very tedious. No matter what language do you prefer: R, python, or even SQL, the process of preprocessing your dataset is generally very time consuming. But, this is not the case if you know how to properly use packages included in tidyverse
.
The term of tidyverse
is actually referring to a set of packages that you’ll find very helpful in any data analysis tasks; many of them are already popular among R users, like dplyr
, ggplot2
, and lubridate
. These packages are intended to work side-by-side; they share the same API style, and sharing a common data representation format. But another things that often missed is that the tidyverse
itself is also a package! The tidyverse
package is intended to make your environment set-up process faster; you only need to install tidyverse
library to install the bundled packages, and just load the tidyverse
library to load all the core packages.
In this articles (or maybe I should say, in this episode), I will cover the basic usage of core packages in tidyverse
, using (of course) starwars
dataset which included along with dplyr
library.
There will be several sequels, and even prequels, for data wrangling using R and tidyverse
tutorial–or to put it simple, data wars :grin:–so stay tuned!
Enter the tidyverse
As I mentioned before, you can install all of the included packages just by installing tidyverse
. So let’s start by installing the package first:
install.packages("tidyverse")
If you are in an interactive session, you could see which packages are installed along with tidyverse
. Just to make sure, let’s load the library:
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.5.3
#> -- Attaching packages ---------------------------- tidyverse 1.2.1 --
#> v ggplot2 3.2.1 v purrr 0.3.3
#> v tibble 2.1.3 v dplyr 0.8.3
#> v tidyr 1.0.0 v stringr 1.4.0
#> v readr 1.3.1 v forcats 0.4.0
#> Warning: package 'ggplot2' was built under R version 3.5.3
#> Warning: package 'tibble' was built under R version 3.5.3
#> Warning: package 'tidyr' was built under R version 3.5.3
#> Warning: package 'readr' was built under R version 3.5.3
#> Warning: package 'purrr' was built under R version 3.5.3
#> Warning: package 'dplyr' was built under R version 3.5.3
#> Warning: package 'stringr' was built under R version 3.5.3
#> Warning: package 'forcats' was built under R version 3.5.3
#> -- Conflicts ------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
When importing the library, it give you messages about which packages that are attached to your session. But it only give you information about the core packages that fully loaded to your session, while actually tidyverse
also attaching some great libraries in the background. You can see them all through sessionInfo()
:
sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17763)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_Indonesia.1252 LC_CTYPE=English_Indonesia.1252
#> [3] LC_MONETARY=English_Indonesia.1252 LC_NUMERIC=C
#> [5] LC_TIME=English_Indonesia.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3
#> [5] readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1
#> [9] tidyverse_1.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.2 cellranger_1.1.0 pillar_1.4.2 compiler_3.5.2
#> [5] tools_3.5.2 zeallot_0.1.0 digest_0.6.22 lubridate_1.7.4
#> [9] jsonlite_1.6 evaluate_0.14 lifecycle_0.1.0 nlme_3.1-141
#> [13] gtable_0.3.0 lattice_0.20-38 pkgconfig_2.0.3 rlang_0.4.1
#> [17] cli_1.1.0 rstudioapi_0.10 yaml_2.2.0 haven_2.1.1
#> [21] blogdown_0.16 xfun_0.10 withr_2.1.2 xml2_1.2.2
#> [25] httr_1.4.1 knitr_1.25 hms_0.5.1 generics_0.0.2
#> [29] vctrs_0.2.0 grid_3.5.2 tidyselect_0.2.5 glue_1.3.1
#> [33] R6_2.4.0 readxl_1.3.1 rmarkdown_1.16 bookdown_0.14
#> [37] modelr_0.1.5 magrittr_1.5 backports_1.1.5 scales_1.0.0
#> [41] htmltools_0.4.0 rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1
#> [45] stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0 broom_0.5.2
#> [49] crayon_1.3.4
As you can see, it actually loads many packages in the background. Don’t worry, I will not cover all of them in this article; But I will cover them for sure in later sequels and prequels :grin:. If you really curious about those packages, I suggest you to checkout the detailed list at the tidyverse
official documentations.
Use the standard data representation: tibble
The first step in learning the tidy data wrangling is to be familiar with its standard data representation. Before I explain it in detail, let me give you a quick example by printing the iris
dataset:
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
and compare the print output with starwars
dataset from dplyr
:
starwars
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke~ 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA>
#> 4 Dart~ 202 136 none white yellow 41.9 male
#> 5 Leia~ 150 49 brown light brown 19 female
#> 6 Owen~ 178 120 brown, gr~ light blue 52 male
#> 7 Beru~ 165 75 brown light blue 47 female
#> 8 R5-D4 97 32 <NA> white, red red NA <NA>
#> 9 Bigg~ 183 84 black light brown 24 male
#> 10 Obi-~ 182 77 auburn, w~ fair blue-gray 57 male
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
There are sooo many differences between the two outputs that I could elaborate more, but let’s focus on several important points:
- When printing
iris
dataset, I had to call it throughhead()
to avoid printing the whole dataset. This is not the case forstarwars
dataset - Base print output is just as-is, not giving some useful information
- If you see the details in
starwars
’ print output, it is showing that it can store a more complex object, likelist
in this case
Why they behave differently? The answer lies in their class:
class(iris)
#> [1] "data.frame"
class(starwars)
#> [1] "tbl_df" "tbl" "data.frame"
The starwars
dataset class is showing that it’s actually a tibble
; which often referred as tbl
or tbl_*
when related to a specific data class. A tibble
is a little bit different than your usual data.frame
. I already outlined the main differences above, but in additition to that, it should be noted that tibble
is the main data representation across the tidyverse
; and also some extension that adopting the tidy principle, e.g., tbl_ts
from tsibble
for tidy time series data representation.
There are so many other advantages if you use tibble
. For example, if you want to subset just one column, a simple data.frame
will failed you again:
iris[, "Species"]
#> [1] setosa setosa setosa setosa setosa setosa
#> [7] setosa setosa setosa setosa setosa setosa
#> [13] setosa setosa setosa setosa setosa setosa
#> [19] setosa setosa setosa setosa setosa setosa
#> [25] setosa setosa setosa setosa setosa setosa
#> [31] setosa setosa setosa setosa setosa setosa
#> [37] setosa setosa setosa setosa setosa setosa
#> [43] setosa setosa setosa setosa setosa setosa
#> [49] setosa setosa versicolor versicolor versicolor versicolor
#> [55] versicolor versicolor versicolor versicolor versicolor versicolor
#> [61] versicolor versicolor versicolor versicolor versicolor versicolor
#> [67] versicolor versicolor versicolor versicolor versicolor versicolor
#> [73] versicolor versicolor versicolor versicolor versicolor versicolor
#> [79] versicolor versicolor versicolor versicolor versicolor versicolor
#> [85] versicolor versicolor versicolor versicolor versicolor versicolor
#> [91] versicolor versicolor versicolor versicolor versicolor versicolor
#> [97] versicolor versicolor versicolor versicolor virginica virginica
#> [103] virginica virginica virginica virginica virginica virginica
#> [109] virginica virginica virginica virginica virginica virginica
#> [115] virginica virginica virginica virginica virginica virginica
#> [121] virginica virginica virginica virginica virginica virginica
#> [127] virginica virginica virginica virginica virginica virginica
#> [133] virginica virginica virginica virginica virginica virginica
#> [139] virginica virginica virginica virginica virginica virginica
#> [145] virginica virginica virginica virginica virginica virginica
#> Levels: setosa versicolor virginica
It will drop its “table” representation and convert the data into a vector
, instead of a data.frame
with one column. But this is not the case with tibble
:
iris_tbl <- as_tibble(iris)
iris_tbl[, "Species"]
#> # A tibble: 150 x 1
#> Species
#> <fct>
#> 1 setosa
#> 2 setosa
#> 3 setosa
#> 4 setosa
#> 5 setosa
#> 6 setosa
#> 7 setosa
#> 8 setosa
#> 9 setosa
#> 10 setosa
#> # ... with 140 more rows
Why this is justified as a more tidy way? because if you recall the R’s standard way to access a specific part from an object is actually using $
:
iris_tbl$Species
#> [1] setosa setosa setosa setosa setosa setosa
#> [7] setosa setosa setosa setosa setosa setosa
#> [13] setosa setosa setosa setosa setosa setosa
#> [19] setosa setosa setosa setosa setosa setosa
#> [25] setosa setosa setosa setosa setosa setosa
#> [31] setosa setosa setosa setosa setosa setosa
#> [37] setosa setosa setosa setosa setosa setosa
#> [43] setosa setosa setosa setosa setosa setosa
#> [49] setosa setosa versicolor versicolor versicolor versicolor
#> [55] versicolor versicolor versicolor versicolor versicolor versicolor
#> [61] versicolor versicolor versicolor versicolor versicolor versicolor
#> [67] versicolor versicolor versicolor versicolor versicolor versicolor
#> [73] versicolor versicolor versicolor versicolor versicolor versicolor
#> [79] versicolor versicolor versicolor versicolor versicolor versicolor
#> [85] versicolor versicolor versicolor versicolor versicolor versicolor
#> [91] versicolor versicolor versicolor versicolor versicolor versicolor
#> [97] versicolor versicolor versicolor versicolor virginica virginica
#> [103] virginica virginica virginica virginica virginica virginica
#> [109] virginica virginica virginica virginica virginica virginica
#> [115] virginica virginica virginica virginica virginica virginica
#> [121] virginica virginica virginica virginica virginica virginica
#> [127] virginica virginica virginica virginica virginica virginica
#> [133] virginica virginica virginica virginica virginica virginica
#> [139] virginica virginica virginica virginica virginica virginica
#> [145] virginica virginica virginica virginica virginica virginica
#> Levels: setosa versicolor virginica
which gives you the exact results.
So my concluding notes are:
- Use
tibble
format whenever it’s possible. I rarely found any situation that usingtbl
ortbl_*
object is not acceptable, because it is still inheriting the standarddata.frame
class. - When working a specific case, e.g., time series dataset, or big data, always look if there are any tidy representation for the data; see
tsibble
andsparklyr
for example.
Easier data manipulation with dplyr
The dplyr
package is probably the most important part of tidyverse
. It brings so many convenient utilities for data manipulation, particularly for some basic tasks; yet sometimes, could be troublesome if you only use base R functionalities.
Among many advanced features inside dplyr
, I suggest you to start with the cores data manipulation functions:
select()
: for selecting a subset of data by columnfilter()
: for filtering a subset of data by row using conditionarrange()
: for re-arranging the rows ordermutate()
: for mutating (:confused:) a new or existing variables
As implied by my description, most of dplyr
functions could be described by its names; except mutate
, which is very likely to be confusing at the first time.
Those function can be naturally combined with the core aggregation functions:
group_by()
: for specifying that following operations should be done by group(s)summarise()
: for obtaining the summarise of the data by the specified group(s).
These functions probably would cover most of your data wrangling tasks; and if done in a proper tidy way, then you’ll find that data wrangling process will be easier. Generally, most of these workflow could be combined through a pipeline using %>%
(pipe) operator to make it tidier.
Let’s try out these features using starwars
dataset. The first step–and I think the most important step–that we should do is setting our output goal. For this tutorial, let’s try to find out how the height and mass differ across homeworld and gender for human species. So we will transform the dataset from:
starwars
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke~ 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA>
#> 4 Dart~ 202 136 none white yellow 41.9 male
#> 5 Leia~ 150 49 brown light brown 19 female
#> 6 Owen~ 178 120 brown, gr~ light blue 52 male
#> 7 Beru~ 165 75 brown light blue 47 female
#> 8 R5-D4 97 32 <NA> white, red red NA <NA>
#> 9 Bigg~ 183 84 black light brown 24 male
#> 10 Obi-~ 182 77 auburn, w~ fair blue-gray 57 male
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
into this format:
# the output goal:
#> # A tibble: 21 x 5
#> homeworld gender mass height bmi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Bestine IV male 110 180 61.1
#> 2 Tatooine male 100. 184. 54.0
#> 3 <NA> male 89 193 46.1
#> 4 Tatooine female 75 164 45.5
#> 5 Bespin male 79 175 45.1
#> 6 Naboo male 80 178. 45.0
#> 7 Corellia male 78.5 175 44.9
#> 8 Haruun Kal male 84 188 44.7
#> 9 Socorro male 79 177 44.6
#> 10 Concord Dawn male 79 183 43.2
#> # ... with 11 more rows
Column subsetting using select()
select()
is probably the first function you will call in every data manipulation tasks. By using select()
, you can narrow down the number variables to fewer variables–which are what you truly need.
The select()
usage (and also every core functions in dplyr
) is very straighforward. You just need to pass the dataset and select the column name you want to select:
select(starwars, name)
#> # A tibble: 87 x 1
#> name
#> <chr>
#> 1 Luke Skywalker
#> 2 C-3PO
#> 3 R2-D2
#> 4 Darth Vader
#> 5 Leia Organa
#> 6 Owen Lars
#> 7 Beru Whitesun lars
#> 8 R5-D4
#> 9 Biggs Darklighter
#> 10 Obi-Wan Kenobi
#> # ... with 77 more rows
It’s very easy right? Before we continue to our objective, there are some tips you need to consider.
Chaining operation using %>%
When we doing a complex data manipulation using dplyr
, I suggest to use %>%
operator to make our code tidier:
starwars %>%
select(name)
#> # A tibble: 87 x 1
#> name
#> <chr>
#> 1 Luke Skywalker
#> 2 C-3PO
#> 3 R2-D2
#> 4 Darth Vader
#> 5 Leia Organa
#> 6 Owen Lars
#> 7 Beru Whitesun lars
#> 8 R5-D4
#> 9 Biggs Darklighter
#> 10 Obi-Wan Kenobi
#> # ... with 77 more rows
This is because %>%
operator could help you to make a chaining operation. The pipe operator will pass the results from its left-hand side, to the first parameter in the function of its right-hand side. So you could make a chaining operation like this:
starwars %>%
select(name) %>%
anotherfuns(<arguments>) %>%
anotherfuns(<arguments>) %>%
<continue the chain as long as necessary> %>%
anotherfuns(<arguments>)
So as long as anotherfuns
you chained could accept the previous result as its first argument, you can chaining the operation as long as you want.
Checking the variables using glimpse()
Now back to our objective. Remember that we want to make a summary of height and mass of human species across homeworld and gender. So we actually need only those few variables to go on. Still, remembering the name of those variables could be frustating sometimes, so I suggest to use glimpse()
function before you jump into variable selection:
glimpse(starwars)
#> Observations: 87
#> Variables: 13
#> $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
#> $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
#> $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
#> $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
#> $ gender <chr> "male", NA, NA, "male", "female", "male", "female",...
#> $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
#> $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
#> $ films <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
#> $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
#> $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...
The glimpse()
function is practically the simplified version of str()
; but it only print the most general information like the variable names, class, etc. Another alternative is to use the base colnames()
, but it doesn’t provide additional information like glimpse()
, so I suggest to use the later.
From glimpse()
output, now we have the list of variable names and can be sure that we have to select species
, homeworld
, gender
, mass
, height
:
starwars %>%
select(name, species, homeworld, gender, mass, height)
#> # A tibble: 87 x 6
#> name species homeworld gender mass height
#> <chr> <chr> <chr> <chr> <dbl> <int>
#> 1 Luke Skywalker Human Tatooine male 77 172
#> 2 C-3PO Droid Tatooine <NA> 75 167
#> 3 R2-D2 Droid Naboo <NA> 32 96
#> 4 Darth Vader Human Tatooine male 136 202
#> 5 Leia Organa Human Alderaan female 49 150
#> 6 Owen Lars Human Tatooine male 120 178
#> 7 Beru Whitesun lars Human Tatooine female 75 165
#> 8 R5-D4 Droid Tatooine <NA> 32 97
#> 9 Biggs Darklighter Human Tatooine male 84 183
#> 10 Obi-Wan Kenobi Human Stewjon male 77 182
#> # ... with 77 more rows
Now the dataset looks a lot simpler 😀
Filtering specific observation using filter()
You already see that a simple column selection could help you in starting the data wrangling process. But row/observation selection is sometimes a compulsory step rather than just a helper step. For example, you want to exclude or include specific observations which inheriting some specific traits. If you recall our objective, and to be more specific at “for human species”, then we are absolutely need to filter our dataset to only containing human species.
As aforementioned, filter()
function is also very easy to use: just pass the dataset object as the first argument, then you can pass the condition(s) as the following argument(s).
To declare the condition, the most common way is to use the base condition testing:
==
and!=
for testing equal, and not equal condition>
or>=
for testing greater than, and greater than or equal<
or<=
for testing less than, and less than or equal
Let’s try to filter observation with mass
less than or equal to 50
from starwars
dataset:
filter(starwars, mass <= 50)
#> # A tibble: 13 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 R2-D2 96 32 <NA> white, bl~ red 33 <NA>
#> 2 Leia~ 150 49 brown light brown 19 female
#> 3 R5-D4 97 32 <NA> white, red red NA <NA>
#> 4 Yoda 66 17 white green brown 896 male
#> 5 Wick~ 88 20 brown brown brown 8 male
#> 6 Sebu~ 112 40 none grey, red orange NA male
#> 7 Dud ~ 94 45 none blue, grey yellow NA male
#> 8 Adi ~ 184 50 none dark blue NA female
#> 9 Barr~ 166 50 black yellow blue 40 female
#> 10 Ratt~ 79 15 none grey, blue unknown NA male
#> 11 Wat ~ 193 48 none green, gr~ unknown NA male
#> 12 Sly ~ 178 48 none pale white NA female
#> 13 Padm~ 165 45 brown light brown 46 female
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
It might be looks very simple, but let’s dive deeper on how condition testing works to see its full potential.
Behind the scene of condition testing
Before I explain the details, let’s take a look at the first 5 mass
value from starwars
:
head(starwars$mass)
#> [1] 77 75 32 136 49 120
If we test the condition of <= 50
to above data, we will actually get this output:
head(starwars$mass) <= 50
#> [1] FALSE FALSE TRUE FALSE TRUE FALSE
So basically filtering condition is picking up rows with the TRUE
value; and it will discard the rest. Based on this mechanism, we could also use other helper functions to test a condition, e.g., when we want to see rows with NA
values, we can use is.na()
function:
filter(starwars, is.na(mass))
#> # A tibble: 28 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Wilh~ 180 NA auburn, g~ fair blue 64 male
#> 2 Mon ~ 150 NA auburn fair blue 48 female
#> 3 Arve~ NA NA brown fair brown NA male
#> 4 Fini~ 170 NA blond fair blue 91 male
#> 5 Rugo~ 206 NA none green orange NA male
#> 6 Ric ~ 183 NA brown fair blue NA male
#> 7 Watto 137 NA black blue, grey yellow NA male
#> 8 Quar~ 183 NA black dark brown 62 male
#> 9 Shmi~ 163 NA black fair brown 72 female
#> 10 Bib ~ 180 NA none pale pink NA male
#> # ... with 18 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
or if we want to keep observations with non-NA
value, then we could create a negation using !
:
filter(starwars, !is.na(mass))
#> # A tibble: 59 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke~ 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl~ red 33 <NA>
#> 4 Dart~ 202 136 none white yellow 41.9 male
#> 5 Leia~ 150 49 brown light brown 19 female
#> 6 Owen~ 178 120 brown, gr~ light blue 52 male
#> 7 Beru~ 165 75 brown light blue 47 female
#> 8 R5-D4 97 32 <NA> white, red red NA <NA>
#> 9 Bigg~ 183 84 black light brown 24 male
#> 10 Obi-~ 182 77 auburn, w~ fair blue-gray 57 male
#> # ... with 49 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
Quick-check on conditions availability
Just like when we check the variables using glimpse()
, we also need to check our variables to make sure about any available options to be filtered. For our case, we already know that we want to get only human observations, but straighly jump into filter()
is very unwise:
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "human")
#> # A tibble: 0 x 6
#> # ... with 6 variables: name <chr>, species <chr>, homeworld <chr>,
#> # gender <chr>, mass <dbl>, height <int>
It seems like there is no "human"
species, but actually it’s just because we specified a wrong condition. For categorical variable like species
, I really suggest you to check it using unique()
first:
unique(starwars$species)
#> [1] "Human" "Droid" "Wookiee" "Rodian"
#> [5] "Hutt" "Yoda's species" "Trandoshan" "Mon Calamari"
#> [9] "Ewok" "Sullustan" "Neimodian" "Gungan"
#> [13] NA "Toydarian" "Dug" "Zabrak"
#> [17] "Twi'lek" "Vulptereen" "Xexto" "Toong"
#> [21] "Cerean" "Nautolan" "Tholothian" "Iktotchi"
#> [25] "Quermian" "Kel Dor" "Chagrian" "Geonosian"
#> [29] "Mirialan" "Clawdite" "Besalisk" "Kaminoan"
#> [33] "Aleena" "Skakoan" "Muun" "Togruta"
#> [37] "Kaleesh" "Pau'an"
Which shows that it’s actually "Human"
. For numerical variables, you can use summary()
to help you quick-check the value range:
summary(starwars$height)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 66.0 167.0 180.0 174.4 191.0 264.0 6
After we quick-check the values within our variable of interest, it’s safe now to call filter()
function:
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human")
#> # A tibble: 35 x 6
#> name species homeworld gender mass height
#> <chr> <chr> <chr> <chr> <dbl> <int>
#> 1 Luke Skywalker Human Tatooine male 77 172
#> 2 Darth Vader Human Tatooine male 136 202
#> 3 Leia Organa Human Alderaan female 49 150
#> 4 Owen Lars Human Tatooine male 120 178
#> 5 Beru Whitesun lars Human Tatooine female 75 165
#> 6 Biggs Darklighter Human Tatooine male 84 183
#> 7 Obi-Wan Kenobi Human Stewjon male 77 182
#> 8 Anakin Skywalker Human Tatooine male 84 188
#> 9 Wilhuff Tarkin Human Eriadu male NA 180
#> 10 Han Solo Human Corellia male 80 180
#> # ... with 25 more rows
Manipulating variables using mutate()
When I mentioned mutate()
terms, most of you that are not familiar with dplyr
probably not sure on its meaning. This function is basically helping us to manipulating the variables inside our dataframe. For example, you could change the units of mass
and height
index:
mutate(starwars, mass = mass * 100, height = height / 100)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke~ 1.72 7700 blond fair blue 19 male
#> 2 C-3PO 1.67 7500 <NA> gold yellow 112 <NA>
#> 3 R2-D2 0.96 3200 <NA> white, bl~ red 33 <NA>
#> 4 Dart~ 2.02 13600 none white yellow 41.9 male
#> 5 Leia~ 1.5 4900 brown light brown 19 female
#> 6 Owen~ 1.78 12000 brown, gr~ light blue 52 male
#> 7 Beru~ 1.65 7500 brown light blue 47 female
#> 8 R5-D4 0.97 3200 <NA> white, red red NA <NA>
#> 9 Bigg~ 1.83 8400 black light brown 24 male
#> 10 Obi-~ 1.82 7700 auburn, w~ fair blue-gray 57 male
#> # ... with 77 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
But it doesn’t limited to manipulating the existing variables; you could also make a new variable using this function. Say, we want to add Body Mass Index (BMI) to our current analysis, we could create it using mutate()
:
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100)
#> # A tibble: 35 x 7
#> name species homeworld gender mass height bmi
#> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
#> 1 Luke Skywalker Human Tatooine male 77 172 44.8
#> 2 Darth Vader Human Tatooine male 136 202 67.3
#> 3 Leia Organa Human Alderaan female 49 150 32.7
#> 4 Owen Lars Human Tatooine male 120 178 67.4
#> 5 Beru Whitesun lars Human Tatooine female 75 165 45.5
#> 6 Biggs Darklighter Human Tatooine male 84 183 45.9
#> 7 Obi-Wan Kenobi Human Stewjon male 77 182 42.3
#> 8 Anakin Skywalker Human Tatooine male 84 188 44.7
#> 9 Wilhuff Tarkin Human Eriadu male NA 180 NA
#> 10 Han Solo Human Corellia male 80 180 44.4
#> # ... with 25 more rows
This function is very versatile, so play around with your creativity here :sunglasses:
Data aggregation using group_by
and summarise
We are reaching the final parts! Let’s recall the part of our objective that we haven’t done yet: “across the homeworld and gender”. This particular phrase is indicating that we want each of our observation to representing its homeworld
and gender
instead of each character’s name
like what we currently have. We can achieve this using group_by()
and summarise()
function.
To declare a group, you just need to specify the name of variables that you want to set as group identifier:
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100) %>%
group_by(homeworld, gender)
#> # A tibble: 35 x 7
#> # Groups: homeworld, gender [21]
#> name species homeworld gender mass height bmi
#> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
#> 1 Luke Skywalker Human Tatooine male 77 172 44.8
#> 2 Darth Vader Human Tatooine male 136 202 67.3
#> 3 Leia Organa Human Alderaan female 49 150 32.7
#> 4 Owen Lars Human Tatooine male 120 178 67.4
#> 5 Beru Whitesun lars Human Tatooine female 75 165 45.5
#> 6 Biggs Darklighter Human Tatooine male 84 183 45.9
#> 7 Obi-Wan Kenobi Human Stewjon male 77 182 42.3
#> 8 Anakin Skywalker Human Tatooine male 84 188 44.7
#> 9 Wilhuff Tarkin Human Eriadu male NA 180 NA
#> 10 Han Solo Human Corellia male 80 180 44.4
#> # ... with 25 more rows
Then the tibble
will automatically give you information about which group is active.
Note that the order of specified groups is also matters. The function will process the group operation from the right-most variable. So the best practice here to set the order from major to minor group:
... %>%
group_by(<major group>, <minor group>, <definitely the minorest group>) %>%
...
After we set the group, we can use summarise()
to do the aggregation. This function is almost identic to how mutate works, but the output will be aggregated value by the group we specified; note that if we don’t specify any group, then it will aggregate the value to one observation that representing our dataset.
Let’s try to take the mean
of height
, mass
, and bmi
variables:
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100) %>%
group_by(homeworld, gender) %>%
summarise(
mass = mean(mass, na.rm = TRUE), # use na.rm since there are
height = mean(height, na.rm = TRUE), # some missing values
bmi = mean(bmi, na.rm = TRUE)
)
#> # A tibble: 21 x 5
#> # Groups: homeworld [16]
#> homeworld gender mass height bmi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Alderaan female 49 150 32.7
#> 2 Alderaan male 79 190. 42.0
#> 3 Bespin male 79 175 45.1
#> 4 Bestine IV male 110 180 61.1
#> 5 Chandrila female NaN 150 NaN
#> 6 Concord Dawn male 79 183 43.2
#> 7 Corellia male 78.5 175 44.9
#> 8 Coruscant female NaN 167 NaN
#> 9 Coruscant male NaN 170 NaN
#> 10 Eriadu male NaN 180 NaN
#> # ... with 11 more rows
When we done with group operations, don’t forget to set the group declaration to off. The most straighforward ways to do this is by passing the result to ungroup()
:
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100) %>%
group_by(homeworld, gender) %>%
summarise(
mass = mean(mass, na.rm = TRUE),
height = mean(height, na.rm = TRUE),
bmi = mean(bmi, na.rm = TRUE)
) %>%
ungroup()
#> # A tibble: 21 x 5
#> homeworld gender mass height bmi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Alderaan female 49 150 32.7
#> 2 Alderaan male 79 190. 42.0
#> 3 Bespin male 79 175 45.1
#> 4 Bestine IV male 110 180 61.1
#> 5 Chandrila female NaN 150 NaN
#> 6 Concord Dawn male 79 183 43.2
#> 7 Corellia male 78.5 175 44.9
#> 8 Coruscant female NaN 167 NaN
#> 9 Coruscant male NaN 170 NaN
#> 10 Eriadu male NaN 180 NaN
#> # ... with 11 more rows
Re-ordering observation using arrange()
Our last results actually enough for showing the summary of mass
and height
across homeworld
and gender
. But it doesn’t give us some basic informations, like, which homeworld-gender combination has the lowest or higher bmi
.
Fortunately, dplyr
provided arrange()
function for this specific task. By default, this function will arrange the specified variable in ascending mode:
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100) %>%
group_by(homeworld, gender) %>%
summarise(
mass = mean(mass, na.rm = TRUE),
height = mean(height, na.rm = TRUE),
bmi = mean(bmi, na.rm = TRUE)
) %>%
ungroup() %>%
arrange(bmi)
#> # A tibble: 21 x 5
#> homeworld gender mass height bmi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Naboo female 45 162. 27.3
#> 2 Alderaan female 49 150 32.7
#> 3 Serenno male 80 193 41.5
#> 4 Alderaan male 79 190. 42.0
#> 5 Stewjon male 77 182 42.3
#> 6 Kamino male 78.2 183 42.7
#> 7 Concord Dawn male 79 183 43.2
#> 8 Socorro male 79 177 44.6
#> 9 Haruun Kal male 84 188 44.7
#> 10 Corellia male 78.5 175 44.9
#> # ... with 11 more rows
And if you want descending sorting, you just need to wrap the variable using desc()
:
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100) %>%
group_by(homeworld, gender) %>%
summarise(
mass = mean(mass, na.rm = TRUE),
height = mean(height, na.rm = TRUE),
bmi = mean(bmi, na.rm = TRUE)
) %>%
ungroup() %>%
arrange(desc(bmi))
#> # A tibble: 21 x 5
#> homeworld gender mass height bmi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Bestine IV male 110 180 61.1
#> 2 Tatooine male 100. 184. 54.0
#> 3 <NA> male 89 193 46.1
#> 4 Tatooine female 75 164 45.5
#> 5 Bespin male 79 175 45.1
#> 6 Naboo male 80 178. 45.0
#> 7 Corellia male 78.5 175 44.9
#> 8 Haruun Kal male 84 188 44.7
#> 9 Socorro male 79 177 44.6
#> 10 Concord Dawn male 79 183 43.2
#> # ... with 11 more rows
Now we know that the females from Naboo have the lowest mean BMI, while the males from Bestine have the highest mean BMI 😀
Concluding remarks
In this article, you already see how easy–and tidy!–it is to do data wrangling tasks using tibble
and dplyr
. But we just scratched the surface of tidyverse
; there are so many package and functionalities available for us to explore!
For the next episode, we will try more realistic cases, and see how tidyverse
packages could help us in a more complex data wrangling tasks. So stay tuned on this series! :sunglasses:
Here is a tidyverse
logo for you:
#> * __ _ __ . o * .
#> / /_(_)__/ /_ ___ _____ _______ ___
#> / __/ / _ / // / |/ / -_) __(_-</ -_)
#> \__/_/\_,_/\_, /|___/\__/_/ /___/\__/
#> * . /___/ o . *