The dplyr
package has become a preferred tool for data manipulation for many R users. There are two innovative ideas implemented by the dplyr package that are making their way into other packages.
Small composable funtions (verbs)
dplyr
provides a handful of verbs that serve as a language of data manipulation. Each verb is function that does one thing to a dataset and does it very well. Examples are select
, filter
, mutate
, group_by
, and summarise
. Other packages being developed with the tidyverse philosophy are adopting this strategy. For example the tidygraph package provides verbs similar to dplyr for working with graph data and the plyranges package provides verbs for working with genomics data. The underlying object is different in each case but the style of manipulating the objects by composing verbs with pipes is used by each of these packages.
Abstracting away the back end and delaying computation
One of the more fantastic things about dplyr
is that it abstracts away where the data is stored. Your data can be a dataframe in memory, a table in pretty much any SQL database, or even an object in Apache Spark. In all cases the dplyr
code is the same. Furthermore dplyr
delays computation until the last possible moment so running dplyr
code on a database amounts to building up the recipe to get the results you want. The recipe is not executed until the user explicitly asks for the result. For computations on large datasets this can be a really nice feature. I think the hailr
package (in-development) for parallelized analysis of genomics data will implement this idea as well.
In this post we will take trip down the rabbit hole of extending dplyr
by creating a function we can use in a pipeline to solve a data cleaning problem. In the process we will explore some advanced programming concepts in R and learn some powerful techniques for extending dplyr
.
Problem statement
Suppose you have a table in a database containing information about some type of entity but it is a bit messy. Each entity has a unique id and some associated attributes all stored in a single table. The trouble is that this data is not normalized. By that I mean that while each entity should have only one value for each attribute it often happens that there are multiple values in the data, spread across multiple rows. Furthermore there is a lot of missing data in the form of NA
s or empty strings. Our goal is to normalize the data so that the attribute columns have a consistent value across id. We also want to keep the data in the database since it is very large.
Let’s create some fake data to play with and load it into a postgres database.
library(dplyr)
df <- tibble::tribble(
~id, ~letter, ~number,
1, "a", NA,
1, "a", "1",
1, "", "1",
1, "", NA,
1, "b", NA,
2, "c", "9",
2, NA, NA)
library(RPostgreSQL)
con <- DBI::dbConnect(DBI::dbDriver("PostgreSQL"),
dbname = "postgres",
host = "localhost",
port = 5432,
user = "postgres",
password = keyring::key_get(service = "postgres", username = "postgres"))
db <- copy_to(con, df)
db
## # Source: table<df> [?? x 3]
## # Database: postgres 10.0.5 [postgres@localhost:5432/postgres]
## id letter number
## <dbl> <chr> <chr>
## 1 1 a <NA>
## 2 1 a 1
## 3 1 "" 1
## 4 1 "" <NA>
## 5 1 b <NA>
## 6 2 c 9
## 7 2 <NA> <NA>
This data contains two entity ids and two attributes. As you can see this data is messy. Lets assume that each entity should be associated with one and only one value for each attribute. There are multiple ways we could normalize this data but we would like to write a function that will assign the most frequently occurring non-missing (blank or NA) value of each attribute to each entity. This is basically a mode function applied to each column within id
. Mode is generally not available in SQL databases and is not actually well defined since there may be ties. We will apply the “Don’t let the perfect be the enemy of the good” rule and ignore the problem of ties for now. Furthermore we don’t want to actually collapse the data since there may be more manipulations we want to perform and some other variables that we do not want to summarize.
The desired solution looks like this.
source("variable_consistency.R")
db %>%
make_consistent(across = "id", letter, number)
## # Source: lazy query [?? x 3]
## # Database: postgres 10.0.5 [postgres@localhost:5432/postgres]
## # Groups: id
## id letter number
## <dbl> <chr> <chr>
## 1 1 a 1
## 2 1 a 1
## 3 1 a 1
## 4 1 a 1
## 5 1 a 1
## 6 2 c 9
## 7 2 c 9
The basic strategy
The basic strategy behind this function is to define a temp variable for each column we want to clean up. This temp variable will contain the count of the number of occurrences of each non-missing value within id
. Missing values (blanks and NA
s) will always get a count of zero. The dplyr
code looks like this.
db2 <- db %>%
group_by(id, letter) %>%
mutate(letter_count = ifelse(is.na(letter) | letter == "", 0, n())) %>%
group_by(id, number) %>%
mutate(number_count = ifelse(is.na(number) | number == "", 0, n())) %>%
group_by(id) %>%
mutate(new_letter = first(letter, order_by = desc(letter_count)),
new_number = first(number, order_by = desc(number_count)))
db2
## # Source: lazy query [?? x 7]
## # Database: postgres 10.0.5 [postgres@localhost:5432/postgres]
## # Groups: id
## id letter number letter_count number_count new_letter new_number
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 1 a 1 2 2 a 1
## 2 1 "" 1 0 2 a 1
## 3 1 b <NA> 1 0 a 1
## 4 1 "" <NA> 0 0 a 1
## 5 1 a <NA> 2 0 a 1
## 6 2 c 9 1 1 c 9
## 7 2 <NA> <NA> 0 0 c 9
db2
actually contains the recipe for how to get the result. Since the data is in a postgres database this recipe is postgres SQL code.
show_query(db2)
## <SQL>
## SELECT "id", "letter", "number", "letter_count", "number_count", first_value("letter") OVER (PARTITION BY "id" ORDER BY "letter_count" DESC) AS "new_letter", first_value("number") OVER (PARTITION BY "id" ORDER BY "number_count" DESC) AS "new_number"
## FROM (SELECT "id", "letter", "number", "letter_count", CASE WHEN ((("number") IS NULL) OR "number" = '') THEN (0.0) WHEN NOT((("number") IS NULL) OR "number" = '') THEN (COUNT(*) OVER (PARTITION BY "id", "number")) END AS "number_count"
## FROM (SELECT "id", "letter", "number", CASE WHEN ((("letter") IS NULL) OR "letter" = '') THEN (0.0) WHEN NOT((("letter") IS NULL) OR "letter" = '') THEN (COUNT(*) OVER (PARTITION BY "id", "letter")) END AS "letter_count"
## FROM "df") "inetcctmyi") "kxzdarxqph"
To get our desired result we just need to drop the two count
columns and overwrite the letter
and number
columns with letter_new
and number_new
respectively. However this only solves our problem in this specific case and we would like a function that we can throw into any dplyr
pipeline.
Concepts
To encapsulate this logic into into a function we will need to introduce a few R programming concepts. The first is programming with dplyr which involves the idea of unquoting. If you are new to this idea you should read the vignette.
Non-standard evaluation
To introduce the idea let’s look at glue
. It is a function that creates strings similar to paste
in base R.
name <- "Adam"
glue::glue("Hi my name is {name}")
## Hi my name is Adam
The characters in the curly braces are interpreted differently than the other characters. The braces tell the glue
function to unquote (i.e. evaluate) whatever is inside.
glue::glue("The square root of pi is approximately {sqrt(pi)}")
## The square root of pi is approximately 1.77245385090552
We can build R expressions in a similar fashion.
four <- 4
ten <- rlang::expr(1 + 2 + 3 + !!four)
ten
## 1 + 2 + 3 + 4
The !!
symbol in ther expr
function is analogous to the {}
in the glue
function.
Then we can manually evaluate an expression.
rlang::eval_tidy(ten)
## [1] 10
In R you can write functions that quote their arguments, capturing the expression and then evaluating it in a non-standard way. This gives R functions the ability to handle their arguments in any way they like.
For example, consider the following code.
x <- 1:4
plot(x, x^2)
Notice the y axis is labeled x^2 which is the expression I passed in. The plot
function quoted that input and used it to both calculate the y variable and label the y axis. Pretty slick.
dplyr
uses this technique heavily to evaluate arguments like variable names in the context of the associated dataframe, generally passed in as the first argument of a verb function.
This print_hello
function quotes its input similar to dplyr functions.
print_hello <- function(name){
name <- rlang::enexpr(name)
glue::glue("Hello {name}")
}
print_hello(Dave)
## Hello Dave
Functions that quote their input are difficult to program with. The !!
unquoting operator helps us get around the difficulties.
Let’s use !!
to wrap our initial solution in a function.
make_consistent <- function(df, var){
var <- rlang::enquo(var)
db %>%
group_by(id, !!var) %>%
mutate(tmp_count = ifelse(is.na(!!var) | !!var == "", 0, n())) %>%
ungroup() %>%
group_by(id) %>%
mutate(!!rlang::quo_name(var) := first(!!var, order_by = desc(tmp_count))) %>%
select(-tmp_count)
}
db %>%
make_consistent(letter)
## # Source: lazy query [?? x 3]
## # Database: postgres 10.0.5 [postgres@localhost:5432/postgres]
## # Groups: id
## id letter number
## <dbl> <chr> <chr>
## 1 1 a 1
## 2 1 a <NA>
## 3 1 a <NA>
## 4 1 a <NA>
## 5 1 a 1
## 6 2 c 9
## 7 2 c <NA>
Not bad.
Sidebar: I will punt the discussion of quosures vs. expressions and enquo
vs enexpr
for now. The short explanation is that an expression is unevaluated R code and a quosure is an R expression that also contains information about the environment it should be evaluated in. enexpr
can be used in a function to quote the input and create an expression while enquo
does the same thing but creates a quosure instead. I think the rule of thumb is to generally prefer quosures over expressions but if your code works under your various test conditions then you might not need to worry about it too much. End Sidebar.
We successfully encapsulated our variable cleaning logic in a function. However this function only fixes one variable at a time. To handle multiple variables simultaneously we need a couple more new concepts.
Using …
The ...
(pronounced dot dot dot) argument is a handy way to let the user pass any number of arguments into your function. Any arguments that are not matched by name or position will match the ...
argument.
Let’s start exploring by printing out structure of the ...
argument.
dot_fun <- function(...){
str(list(...))
}
dot_fun("hi", "there", "everyone")
## List of 3
## $ : chr "hi"
## $ : chr "there"
## $ : chr "everyone"
We can also quote the extra arguments and capture them in a list.
dot_fun <- function(...){
str(rlang::exprs(...))
}
dot_fun(hi, there, everyone)
## List of 3
## $ : symbol hi
## $ : symbol there
## $ : symbol everyone
Build a list with the pipe
The pipe makes code more readable and can be used to build a list.
list("a") %>% c("b") %>% c("c")
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "b"
##
## [[3]]
## [1] "c"
Lists can contain anything in R including steps of a dplyr
pipeline.
list(rlang::expr(db)) %>%
c(rlang::expr(group_by(id))) %>%
c(rlang::expr(mutate(letter = max(letter))))
## [[1]]
## db
##
## [[2]]
## group_by(id)
##
## [[3]]
## mutate(letter = max(letter))
We can use the reduce
function in the purrr
package to collapse our dplyr
list into a single dplyr
expression.
dplyr_expr <- list(rlang::expr(df)) %>%
c(rlang::expr(group_by(id))) %>%
c(rlang::expr(mutate(letter = max(letter)))) %>%
purrr::reduce(function(a, b){rlang::expr(!!a %>% !!b)})
dplyr_expr
## df %>% group_by(id) %>% mutate(letter = max(letter))
Then we can evaluate the dplyr expression.
rlang::eval_tidy(dplyr_expr)
## # A tibble: 7 x 3
## # Groups: id [2]
## id letter number
## <dbl> <chr> <chr>
## 1 1 b <NA>
## 2 1 b 1
## 3 1 b 1
## 4 1 b <NA>
## 5 1 b <NA>
## 6 2 <NA> 9
## 7 2 <NA> <NA>
Putting these ideas together
Let’s create a function that changes each selected variable to an “a”.
make_a <- function(df, ...){
# save arguments as a quoted list
dots <- rlang::quos(...)
# create replacement expressions assigning the letter a to each variable
replacement_expr <- purrr::map(dots, ~rlang::expr(!!rlang::quo_name(.x) := "a"))
print(replacement_expr)
# create a dplyr pipline splicing in the replacement expressions
final_expr <- list(rlang::expr(df)) %>%
c(rlang::expr(group_by(id))) %>%
c(rlang::expr(mutate(!!!replacement_expr))) %>%
purrr::reduce(function(a,b) rlang::expr(!!a %>% !!b))
print(final_expr)
# return the evaluated dplyr pipeline
rlang::eval_tidy(final_expr)
}
make_a(db, letter, number)
## [[1]]
## `:=`("letter", "a")
##
## [[2]]
## `:=`("number", "a")
##
## df %>% group_by(id) %>% mutate(`:=`("letter", "a"), `:=`("number",
## "a"))
## # Source: lazy query [?? x 3]
## # Database: postgres 10.0.5 [postgres@localhost:5432/postgres]
## # Groups: id
## id letter number
## <dbl> <chr> <chr>
## 1 1 a a
## 2 1 a a
## 3 1 a a
## 4 1 a a
## 5 1 a a
## 6 2 a a
## 7 2 a a
Now instead of setting each variable to “a” we will set each to the most commonly occurring non-missing value within each id
.
make_consistent <- function(df, ...){
# quote variables and save as a list
dots <- rlang::quos(...)
# we need a temporary counter variable for each variable in our data
# we will assume there are no existing variables in the database table starting with underscore
tmp <- paste0("_tmp_", 1:length(dots))
print(tmp)
# create a list of group_by mutate steps that add the counter variables
counter_expr <- purrr::map2(dots, tmp, ~list(
rlang::expr(group_by(id, !!.x)),
rlang::expr(mutate(!!.y := ifelse(is.na(!!.x) | !!.x == "", 0, n())))
))
print(counter_expr)
# now create the replacement expressions that will assign the most frequent value to each variable
replacement_expr <- purrr::map2(dots, tmp,
~rlang::expr(!!rlang::quo_name(.x) := first(!!.x, order_by = desc(!!as.name(.y)))))
print(replacement_expr)
# put it all together using the dplyr list approach
final_expr <- list(rlang::expr(df)) %>%
c(counter_expr) %>%
c(rlang::expr(ungroup())) %>%
c(rlang::expr(group_by(id))) %>%
c(rlang::expr(mutate(!!!replacement_expr))) %>%
c(rlang::expr(select(-starts_with("_tmp_")))) %>%
rlang::flatten() %>% # because we have nested lists we need flatten to create a flat list
purrr::reduce(function(a,b) rlang::expr(!!a %>% !!b))
print(final_expr)
# finally return the evaluated expression
rlang::eval_tidy(final_expr)
}
db2 <- db %>%
make_consistent(letter, number)
## [1] "_tmp_1" "_tmp_2"
## [[1]]
## [[1]][[1]]
## group_by(id, ~letter)
##
## [[1]][[2]]
## mutate(`:=`("_tmp_1", ifelse(is.na(~letter) | (~letter) == "",
## 0, n())))
##
##
## [[2]]
## [[2]][[1]]
## group_by(id, ~number)
##
## [[2]][[2]]
## mutate(`:=`("_tmp_2", ifelse(is.na(~number) | (~number) == "",
## 0, n())))
##
##
## [[1]]
## `:=`("letter", first(~letter, order_by = desc(`_tmp_1`)))
##
## [[2]]
## `:=`("number", first(~number, order_by = desc(`_tmp_2`)))
##
## df %>% group_by(id, ~letter) %>% mutate(`:=`("_tmp_1", ifelse(is.na(~letter) |
## (~letter) == "", 0, n()))) %>% group_by(id, ~number) %>%
## mutate(`:=`("_tmp_2", ifelse(is.na(~number) | (~number) ==
## "", 0, n()))) %>% ungroup() %>% group_by(id) %>% mutate(`:=`("letter",
## first(~letter, order_by = desc(`_tmp_1`))), `:=`("number",
## first(~number, order_by = desc(`_tmp_2`)))) %>% select(-starts_with("_tmp_"))
db2
## # Source: lazy query [?? x 3]
## # Database: postgres 10.0.5 [postgres@localhost:5432/postgres]
## # Groups: id
## id letter number
## <dbl> <chr> <chr>
## 1 1 a 1
## 2 1 a 1
## 3 1 a 1
## 4 1 a 1
## 5 1 a 1
## 6 2 c 9
## 7 2 c 9
Lets take a look at all the SQL code produced by this one R function.
show_query(db2)
## <SQL>
## SELECT "id", "letter", "number"
## FROM (SELECT "id", first_value("letter") OVER (PARTITION BY "id" ORDER BY "_tmp_1" DESC) AS "letter", first_value("number") OVER (PARTITION BY "id" ORDER BY "_tmp_2" DESC) AS "number", "_tmp_1", "_tmp_2"
## FROM (SELECT "id", "letter", "number", "_tmp_1", CASE WHEN ((("number") IS NULL) OR "number" = '') THEN (0.0) WHEN NOT((("number") IS NULL) OR "number" = '') THEN (COUNT(*) OVER (PARTITION BY "id", "number")) END AS "_tmp_2"
## FROM (SELECT "id", "letter", "number", CASE WHEN ((("letter") IS NULL) OR "letter" = '') THEN (0.0) WHEN NOT((("letter") IS NULL) OR "letter" = '') THEN (COUNT(*) OVER (PARTITION BY "id", "letter")) END AS "_tmp_1"
## FROM "df") "wbyujqflur") "pagajkfaun") "dyekutmwyi"
One small change left
We hard coded the name of the id
variable. It would be better to allow the user to specify the name of the id
column. I’m going to ask for a character string rather than a bare variable name to distinguish this special variable from the rest.
make_consistent <- function(df, across, ...){
# allow the user to specify the name of the id variable as a character string
stopifnot(is.character(across))
across <- as.name(across)
# quote variables and save as a list
dots <- rlang::quos(...)
# we need a temporary counter variable for each variable in our data
# we will assume there are no existing variables in the database table starting with underscore
tmp <- paste0("_tmp_", 1:length(dots))
# create a list of group_by mutate steps that add the counter variables
counter_expr <- purrr::map2(dots, tmp, ~list(
rlang::expr(group_by(!!across, !!.x)),
rlang::expr(mutate(!!.y := ifelse(is.na(!!.x) | !!.x == "", 0, n())))
))
# now create the replacement expressions that will assign the most frequent value to each variable
replacement_expr <- purrr::map2(dots, tmp,
~rlang::expr(!!rlang::quo_name(.x) := first(!!.x, order_by = desc(!!as.name(.y)))))
# put it all together using the dplyr list approach
final_expr <- list(rlang::expr(df)) %>%
c(counter_expr) %>%
c(rlang::expr(ungroup())) %>%
c(rlang::expr(group_by(!!across))) %>%
c(rlang::expr(mutate(!!!replacement_expr))) %>%
c(rlang::expr(select(-starts_with("_tmp_")))) %>%
rlang::flatten() %>%
purrr::reduce(function(a,b) rlang::expr(!!a %>% !!b))
# finally return the evaluated expression
rlang::eval_tidy(final_expr)
}
Whew that was a lot of work and kind of a mind bender! However now we never need to think about this particular problem again. Whenever I encounter a situation where I need to make one variable consistent across levels of another based on the mode function I just add one line to my dplyr
pipeline.
db %>%
mutate(new_var = "new_var") %>%
make_consistent(across = "id", letter, number, new_var) %>%
distinct()
## # Source: lazy query [?? x 4]
## # Database: postgres 10.0.5 [postgres@localhost:5432/postgres]
## # Groups: id
## id letter number new_var
## <dbl> <chr> <chr> <chr>
## 1 2 c 9 new_var
## 2 1 a 1 new_var
Summary
dplyr
has been successful because it is easy to use and implements a few powerful ideas:
- Small composable verb functions
- Abstraction of back-end object
- Delayed computation
Being able to extend dplyr
with your own verb functions that encapsulate commonly used data manipulations is powerful because it allows you to solve a problem once and then apply that solution as a single line in your code in the future. Writing functions is the first level of this type of logic encapsulation. Creating packages is the next level up, but that is a topic for another meetup.
To learn about more about these and other cool R programming ideas check out the book Advanced R by Hadley Wickham.