library(tidyverse) library(glue) # similar to paste() but more powerful
All of the ideas in this talk come from other people, primarily the R for Data Science iteration chapter.
library(tidyverse) library(glue) # similar to paste() but more powerful
All of the ideas in this talk come from other people, primarily the R for Data Science iteration chapter.
For example you can define a new ifelse function that treats NA like any other value.
ifelse2 <- function(test, yes, no) ifelse(!is.na(test) & test, yes, no) tibble(age = c(0, NA, 5, 15, 21, 35)) %>% mutate(ifelse = ifelse(age < 18, "Minor", "Adult"), # Base R version ifelse2 = ifelse2(age < 18, "Minor", "Adult"), # My custom version if_else = if_else(age < 18, "Minor", "Adult")) # Tidyverse version
## # A tibble: 6 x 4 ## age ifelse ifelse2 if_else ## <dbl> <chr> <chr> <chr> ## 1 0 Minor Minor Minor ## 2 NA <NA> Adult <NA> ## 3 5 Minor Minor Minor ## 4 15 Minor Minor Minor ## 5 21 Adult Adult Adult ## 6 35 Adult Adult Adult
if_else has an argument for explicit NA handling
tibble(age = c(0, NA, 5, 15, 21, 35)) %>% mutate(ifelse = ifelse(age < 18, "Minor", "Adult"), ifelse2 = ifelse2(age < 18, "Minor", "Adult"), if_else = if_else(age < 18, "Minor", "Adult", missing = "We don't know"))
## # A tibble: 6 x 4 ## age ifelse ifelse2 if_else ## <dbl> <chr> <chr> <chr> ## 1 0 Minor Minor Minor ## 2 NA <NA> Adult We don't know ## 3 5 Minor Minor Minor ## 4 15 Minor Minor Minor ## 5 21 Adult Adult Adult ## 6 35 Adult Adult Adult
tibble(age = c(0, NA, 5, 15, 21, 35)) %>% mutate(case_when = case_when( age < 1 ~ "Baby", age < 18 ~ "Minor", T ~ "Adult"))
## # A tibble: 6 x 2 ## age case_when ## <dbl> <chr> ## 1 0 Baby ## 2 NA Adult ## 3 5 Minor ## 4 15 Minor ## 5 21 Adult ## 6 35 Adult
tibble(age = c(0, NA, 5, 15, 21, 35)) %>% mutate(case_when = case_when( age < 1 ~ "Baby", age < 18 ~ "Minor", is.na(age) ~ "We don't know", T ~ "Adult"))
## # A tibble: 6 x 2 ## age case_when ## <dbl> <chr> ## 1 0 Baby ## 2 NA We don't know ## 3 5 Minor ## 4 15 Minor ## 5 21 Adult ## 6 35 Adult
ifelse(1:6 > 3, ">3", c("less", "than 3"))
## [1] "less" "than 3" "less" ">3" ">3" ">3"
ifelse(1:6 > 3, ">3", 4)
## [1] "4" "4" "4" ">3" ">3" ">3"
if_else(1:6 > 3, ">3", c("less", "than 3"))
## Error: `false` must be length 6 (length of `condition`) or one, not 2
if_else(1:6 > 3, ">3", 4)
## Error: `false` must be a character vector, not a double vector
if_else(1:6 > 3, ">3", NA)
## Error: `false` must be a character vector, not a logical vector
typeof(NA)
## [1] "logical"
if_else(1:6 > 3, ">3", NA_character_)
## [1] NA NA NA ">3" ">3" ">3"
complete()
makes implicit missing values explicit by filling in missing combinations.
tibble(patkey = 1:3, visit_number = 1:3) %>% complete(patkey, visit_number)
## # A tibble: 9 x 2 ## patkey visit_number ## <int> <int> ## 1 1 1 ## 2 1 2 ## 3 1 3 ## 4 2 1 ## 5 2 2 ## 6 2 3 ## 7 3 1 ## 8 3 2 ## 9 3 3
Creating fake data is really helpful for creating a reproducible example (aka reprex)
npats <- 100 nvisits <- 10 create_visit_df <- function(visit_number) { tibble( patid = 1:npats, visitdate = lubridate::as_date(sample(0:1e4, 100)), blooddraw = rbinom(npats, 1, prob = .7), bloodrawwho_nurse = blooddraw*rbinom(npats, 1, prob = .7), bloodrawwho_other = blooddraw*(1-bloodrawwho_nurse) ) %>% rename_at(vars(-patid), ~paste0(., "_", visit_number)) } visit_wide <- map(1:nvisits, create_visit_df) %>% reduce(inner_join, by = "patid")
## # A tibble: 1,000 x 5 ## patid visitdate blooddraw bloodrawwho_nurse bloodrawwho_other ## <int> <date> <int> <int> <dbl> ## 1 1 1986-09-08 1 1 0 ## 2 1 1986-10-19 0 0 0 ## 3 1 1980-08-15 0 0 0 ## 4 1 1975-11-30 1 0 1 ## 5 1 1993-03-26 0 0 0 ## 6 1 1994-03-24 1 0 1 ## 7 1 1973-05-16 0 0 0 ## 8 1 1995-04-21 1 0 1 ## 9 1 1990-11-17 1 0 1 ## 10 1 1992-01-12 1 1 0 ## # … with 990 more rows
In computer science, a list or sequence is an abstract data type that represents a countable number of ordered values, where the same value may occur more than once.
https://en.wikipedia.org/wiki/List_(abstract_data_type)
Basically it is just a container.
random_stuff <- list(mtcars, qplot(mpg, cyl, data = mtcars), function(x) x^2, list("a", "b", "c")) str(random_stuff, max.level = 1)
## List of 4 ## $ :'data.frame': 32 obs. of 11 variables: ## $ :List of 9 ## ..- attr(*, "class")= chr [1:2] "gg" "ggplot" ## $ :function (x) ## ..- attr(*, "srcref")= 'srcref' int [1:8] 4 22 4 36 22 36 4 4 ## .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x8026b08> ## $ :List of 3
More complex objects in R are often just lists.
list(1:3, 4:10, seq(0, 20, by = 2))
## [[1]] ## [1] 1 2 3 ## ## [[2]] ## [1] 4 5 6 7 8 9 10 ## ## [[3]] ## [1] 0 2 4 6 8 10 12 14 16 18 20
Put the output of the element in a new list
numeric_list <- list(1:3, 4:10, seq(0, 20, by = 2)) map(numeric_list, sum)
## [[1]] ## [1] 6 ## ## [[2]] ## [1] 49 ## ## [[3]] ## [1] 110
numeric_list <- list(1:3, 4:10, seq(0, 20, by = 2)) L2norm <- function(x) sqrt(sum(x^2)) map(numeric_list, L2norm)
## [[1]] ## [1] 3.741657 ## ## [[2]] ## [1] 19.26136 ## ## [[3]] ## [1] 39.24283
Also called an anonymous function since it has no name
numeric_list <- list(1:3, 4:10, seq(0, 20, by = 2)) map(numeric_list, function(x) sqrt(sum(x^2)))
## [[1]] ## [1] 3.741657 ## ## [[2]] ## [1] 19.26136 ## ## [[3]] ## [1] 39.24283
This shortcut only works in the tidyverse.
numeric_list <- list(1:3, 4:10, seq(0, 20, by = 2)) map(numeric_list, ~sqrt(sum(.^2)))
## [[1]] ## [1] 3.741657 ## ## [[2]] ## [1] 19.26136 ## ## [[3]] ## [1] 39.24283
Use a function that takes two arguments and returns one value to reduce a list.
+
is such a function
numeric_list <- list(1:3, 4:10, seq(0, 20, by = 2)) map(numeric_list, ~sqrt(sum(.^2))) %>% reduce(`+`)
## [1] 62.24585
Anything inside {} is evaluated as code
map(1:3, ~glue("_{.}")) %>% reduce(c)
## [1] "_1" "_2" "_3"
map_chr
returns a character vector.
There are many other variants of the map function.
map_chr(1:3, ~glue("_{.}"))
## [1] "_1" "_2" "_3"
We can easily select all variables for one visit
select(visit_wide, patid, ends_with("_1"))
## # A tibble: 100 x 5 ## patid visitdate_1 blooddraw_1 bloodrawwho_nurse_1 bloodrawwho_other_1 ## <int> <date> <int> <int> <dbl> ## 1 1 1986-09-08 1 1 0 ## 2 2 1973-07-28 0 0 0 ## 3 3 1980-10-20 1 1 0 ## 4 4 1997-03-23 1 1 0 ## 5 5 1970-03-08 1 1 0 ## 6 6 1991-11-23 0 0 0 ## 7 7 1973-02-05 0 0 0 ## 8 8 1983-04-28 1 0 1 ## 9 9 1981-03-06 1 1 0 ## 10 10 1982-12-08 1 1 0 ## # … with 90 more rows
df_list <- map(1:nvisits, ~select(visit_wide, patid, ends_with(glue("_{.}")))) str(df_list, max.level = 1)
## List of 10 ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 5 variables: ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 5 variables: ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 5 variables: ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 5 variables: ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 5 variables: ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 5 variables: ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 5 variables: ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 5 variables: ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 5 variables: ## $ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 5 variables:
df_list[[1]]
## # A tibble: 100 x 5 ## patid visitdate_1 blooddraw_1 bloodrawwho_nurse_1 bloodrawwho_other_1 ## <int> <date> <int> <int> <dbl> ## 1 1 1986-09-08 1 1 0 ## 2 2 1973-07-28 0 0 0 ## 3 3 1980-10-20 1 1 0 ## 4 4 1997-03-23 1 1 0 ## 5 5 1970-03-08 1 1 0 ## 6 6 1991-11-23 0 0 0 ## 7 7 1973-02-05 0 0 0 ## 8 8 1983-04-28 1 0 1 ## 9 9 1981-03-06 1 1 0 ## 10 10 1982-12-08 1 1 0 ## # … with 90 more rows
df_list[[2]]
## # A tibble: 100 x 5 ## patid visitdate_2 blooddraw_2 bloodrawwho_nurse_2 bloodrawwho_other_2 ## <int> <date> <int> <int> <dbl> ## 1 1 1986-10-19 0 0 0 ## 2 2 1985-01-26 1 0 1 ## 3 3 1984-04-29 0 0 0 ## 4 4 1991-07-18 0 0 0 ## 5 5 1972-08-25 1 0 1 ## 6 6 1978-08-27 0 0 0 ## 7 7 1975-12-11 1 1 0 ## 8 8 1981-11-08 1 0 1 ## 9 9 1975-08-18 1 0 1 ## 10 10 1974-08-14 0 0 0 ## # … with 90 more rows
df_list2 <- map(df_list, ~rename_all(., ~str_remove(., "_[:digit:]+$"))) map(df_list2, names)
## [[1]] ## [1] "patid" "visitdate" "blooddraw" ## [4] "bloodrawwho_nurse" "bloodrawwho_other" ## ## [[2]] ## [1] "patid" "visitdate" "blooddraw" ## [4] "bloodrawwho_nurse" "bloodrawwho_other" ## ## [[3]] ## [1] "patid" "visitdate" "blooddraw" ## [4] "bloodrawwho_nurse" "bloodrawwho_other" ## ## [[4]] ## [1] "patid" "visitdate" "blooddraw" ## [4] "bloodrawwho_nurse" "bloodrawwho_other" ## ## [[5]] ## [1] "patid" "visitdate" "blooddraw" ## [4] "bloodrawwho_nurse" "bloodrawwho_other" ## ## [[6]] ## [1] "patid" "visitdate" "blooddraw" ## [4] "bloodrawwho_nurse" "bloodrawwho_other" ## ## [[7]] ## [1] "patid" "visitdate" "blooddraw" ## [4] "bloodrawwho_nurse" "bloodrawwho_other" ## ## [[8]] ## [1] "patid" "visitdate" "blooddraw" ## [4] "bloodrawwho_nurse" "bloodrawwho_other" ## ## [[9]] ## [1] "patid" "visitdate" "blooddraw" ## [4] "bloodrawwho_nurse" "bloodrawwho_other" ## ## [[10]] ## [1] "patid" "visitdate" "blooddraw" ## [4] "bloodrawwho_nurse" "bloodrawwho_other"
bind_rows takes two dataframes and rowbinds them reducing the list with bindrows will iteratively apply bind_rows so we end up with one dataframe
reduce(df_list2, bind_rows)
## # A tibble: 1,000 x 5 ## patid visitdate blooddraw bloodrawwho_nurse bloodrawwho_other ## <int> <date> <int> <int> <dbl> ## 1 1 1986-09-08 1 1 0 ## 2 2 1973-07-28 0 0 0 ## 3 3 1980-10-20 1 1 0 ## 4 4 1997-03-23 1 1 0 ## 5 5 1970-03-08 1 1 0 ## 6 6 1991-11-23 0 0 0 ## 7 7 1973-02-05 0 0 0 ## 8 8 1983-04-28 1 0 1 ## 9 9 1981-03-06 1 1 0 ## 10 10 1982-12-08 1 1 0 ## # … with 990 more rows
map(1:nvisits, ~select(visit_wide, patid, ends_with(glue("_{.}")))) %>% map(~rename_all(., ~str_remove(., "_[:digit:]+$"))) %>% reduce(bind_rows)
## # A tibble: 1,000 x 5 ## patid visitdate blooddraw bloodrawwho_nurse bloodrawwho_other ## <int> <date> <int> <int> <dbl> ## 1 1 1986-09-08 1 1 0 ## 2 2 1973-07-28 0 0 0 ## 3 3 1980-10-20 1 1 0 ## 4 4 1997-03-23 1 1 0 ## 5 5 1970-03-08 1 1 0 ## 6 6 1991-11-23 0 0 0 ## 7 7 1973-02-05 0 0 0 ## 8 8 1983-04-28 1 0 1 ## 9 9 1981-03-06 1 1 0 ## 10 10 1982-12-08 1 1 0 ## # … with 990 more rows
map(1:nvisits, ~select(visit_wide, patid, ends_with(glue("_{.}")))) %>% map(~rename_all(., ~str_remove(., "_[:digit:]+$"))) %>% bind_rows()
## # A tibble: 1,000 x 5 ## patid visitdate blooddraw bloodrawwho_nurse bloodrawwho_other ## <int> <date> <int> <int> <dbl> ## 1 1 1986-09-08 1 1 0 ## 2 2 1973-07-28 0 0 0 ## 3 3 1980-10-20 1 1 0 ## 4 4 1997-03-23 1 1 0 ## 5 5 1970-03-08 1 1 0 ## 6 6 1991-11-23 0 0 0 ## 7 7 1973-02-05 0 0 0 ## 8 8 1983-04-28 1 0 1 ## 9 9 1981-03-06 1 1 0 ## 10 10 1982-12-08 1 1 0 ## # … with 990 more rows
map(1:nvisits, ~select(visit_wide, patid, ends_with(glue("_{.}")))) %>% map_dfr(~rename_all(., ~str_remove(., "_[:digit:]+$")))
## # A tibble: 1,000 x 5 ## patid visitdate blooddraw bloodrawwho_nurse bloodrawwho_other ## <int> <date> <int> <int> <dbl> ## 1 1 1986-09-08 1 1 0 ## 2 2 1973-07-28 0 0 0 ## 3 3 1980-10-20 1 1 0 ## 4 4 1997-03-23 1 1 0 ## 5 5 1970-03-08 1 1 0 ## 6 6 1991-11-23 0 0 0 ## 7 7 1973-02-05 0 0 0 ## 8 8 1983-04-28 1 0 1 ## 9 9 1981-03-06 1 1 0 ## 10 10 1982-12-08 1 1 0 ## # … with 990 more rows
map2 is similar to map but loops over two arguments .x and .y in parallel.
In this case we are looping over our datafame list and the numbers 1 to 10.
map(1:nvisits, ~select(visit_wide, patid, ends_with(glue("_{.}")))) %>% map2(1:nvisits, ~mutate(.x, visit_number = .y)) %>% map_dfr(~rename_all(., ~str_remove(., "_[:digit:]+$")))
## # A tibble: 1,000 x 6 ## patid visitdate blooddraw bloodrawwho_nur… bloodrawwho_oth… ## <int> <date> <int> <int> <dbl> ## 1 1 1986-09-08 1 1 0 ## 2 2 1973-07-28 0 0 0 ## 3 3 1980-10-20 1 1 0 ## 4 4 1997-03-23 1 1 0 ## 5 5 1970-03-08 1 1 0 ## 6 6 1991-11-23 0 0 0 ## 7 7 1973-02-05 0 0 0 ## 8 8 1983-04-28 1 0 1 ## 9 9 1981-03-06 1 1 0 ## 10 10 1982-12-08 1 1 0 ## # … with 990 more rows, and 1 more variable: visit_number <int>
Create a SQLite database in memory and load our visit data into it
library(DBI) con <- dbConnect(RSQLite::SQLite(), ":memory:") dbWriteTable(con, "visit_wide", visit_wide) visit_db <- tbl(con, "visit_wide")
We just change bind_rows to union_all
map(1:nvisits, ~select(visit_db, patid, ends_with(glue("_{.}")))) %>% map(~rename_all(., ~str_remove(., "_[:digit:]+$"))) %>% reduce(union_all)
## # Source: lazy query [?? x 5] ## # Database: sqlite 3.22.0 [:memory:] ## patid visitdate blooddraw bloodrawwho_nurse bloodrawwho_other ## <int> <dbl> <int> <int> <dbl> ## 1 1 6094 1 1 0 ## 2 2 1304 0 0 0 ## 3 3 3945 1 1 0 ## 4 4 9943 1 1 0 ## 5 5 66 1 1 0 ## 6 6 7996 0 0 0 ## 7 7 1131 0 0 0 ## 8 8 4865 1 0 1 ## 9 9 4082 1 1 0 ## 10 10 4724 1 1 0 ## # … with more rows
map(1:nvisits, ~select(visit_db, patid, ends_with(glue("_{.}")))) %>% map(~rename_all(., ~str_remove(., "_[:digit:]+$"))) %>% reduce(union_all) %>% show_query()
## <SQL> ## SELECT `patid`, `visitdate_1` AS `visitdate`, `blooddraw_1` AS `blooddraw`, `bloodrawwho_nurse_1` AS `bloodrawwho_nurse`, `bloodrawwho_other_1` AS `bloodrawwho_other` ## FROM `visit_wide` ## UNION ALL ## SELECT `patid`, `visitdate_2` AS `visitdate`, `blooddraw_2` AS `blooddraw`, `bloodrawwho_nurse_2` AS `bloodrawwho_nurse`, `bloodrawwho_other_2` AS `bloodrawwho_other` ## FROM `visit_wide` ## UNION ALL ## SELECT `patid`, `visitdate_3` AS `visitdate`, `blooddraw_3` AS `blooddraw`, `bloodrawwho_nurse_3` AS `bloodrawwho_nurse`, `bloodrawwho_other_3` AS `bloodrawwho_other` ## FROM `visit_wide` ## UNION ALL ## SELECT `patid`, `visitdate_4` AS `visitdate`, `blooddraw_4` AS `blooddraw`, `bloodrawwho_nurse_4` AS `bloodrawwho_nurse`, `bloodrawwho_other_4` AS `bloodrawwho_other` ## FROM `visit_wide` ## UNION ALL ## SELECT `patid`, `visitdate_5` AS `visitdate`, `blooddraw_5` AS `blooddraw`, `bloodrawwho_nurse_5` AS `bloodrawwho_nurse`, `bloodrawwho_other_5` AS `bloodrawwho_other` ## FROM `visit_wide` ## UNION ALL ## SELECT `patid`, `visitdate_6` AS `visitdate`, `blooddraw_6` AS `blooddraw`, `bloodrawwho_nurse_6` AS `bloodrawwho_nurse`, `bloodrawwho_other_6` AS `bloodrawwho_other` ## FROM `visit_wide` ## UNION ALL ## SELECT `patid`, `visitdate_7` AS `visitdate`, `blooddraw_7` AS `blooddraw`, `bloodrawwho_nurse_7` AS `bloodrawwho_nurse`, `bloodrawwho_other_7` AS `bloodrawwho_other` ## FROM `visit_wide` ## UNION ALL ## SELECT `patid`, `visitdate_8` AS `visitdate`, `blooddraw_8` AS `blooddraw`, `bloodrawwho_nurse_8` AS `bloodrawwho_nurse`, `bloodrawwho_other_8` AS `bloodrawwho_other` ## FROM `visit_wide` ## UNION ALL ## SELECT `patid`, `visitdate_9` AS `visitdate`, `blooddraw_9` AS `blooddraw`, `bloodrawwho_nurse_9` AS `bloodrawwho_nurse`, `bloodrawwho_other_9` AS `bloodrawwho_other` ## FROM `visit_wide` ## UNION ALL ## SELECT `patid`, `visitdate_10` AS `visitdate`, `blooddraw_10` AS `blooddraw`, `bloodrawwho_nurse_10` AS `bloodrawwho_nurse`, `bloodrawwho_other_10` AS `bloodrawwho_other` ## FROM `visit_wide`
The map and reduce functions allow us to program on a higher level of abstraction. As the number of visits increases SQL requires more code while the R code does not need to change. map
and reduce
basically just write for
loops for you.
We can think of the scoped variants of dplyr in much the same way.
mutate_if(); mutate_at(); mutate_all() select_if(); select_at(); select_all() filter_if(); filter_at(); filter_all() rename_if(); rename_at(); rename_all() summarise_if(); summarise_at(); summarise_all()
These functions map over a dataframe which is a list of columns.
lst(min, mean, median, max, sd) %>% map_dfr(~map(mtcars, .), .id = "summary")
## # A tibble: 5 x 12 ## summary mpg cyl disp hp drat wt qsec vs am gear carb ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 min 10.4 4 71.1 52 2.76 1.51 14.5 0 0 3 1 ## 2 mean 20.1 6.19 231. 147. 3.60 3.22 17.8 0.438 0.406 3.69 2.81 ## 3 median 19.2 6 196. 123 3.70 3.32 17.7 0 0 4 2 ## 4 max 33.9 8 472 335 4.93 5.42 22.9 1 1 5 8 ## 5 sd 6.03 1.79 124. 68.6 0.535 0.978 1.79 0.504 0.499 0.738 1.62
This works because a dataframe is a list. It is a list of columns.
is.list(mtcars)
## [1] TRUE
Since list can contain other lists we can put anything inside a dataframe as a list column.
Here we split the mtcars dataframe by the cyl variable and fit a linear model to each subset. Then we pull out the \(R^2\) value from each model.
model_df <- mtcars %>% nest(data = -cyl) %>% mutate(model = map(data, ~lm(mpg ~ ., data = .))) %>% mutate(r_squared = map_dbl(model, ~summary(.)$r.squared)) model_df
## # A tibble: 3 x 4 ## cyl data model r_squared ## <dbl> <list<df[,10]>> <list> <dbl> ## 1 6 [7 × 10] <lm> 1 ## 2 4 [11 × 10] <lm> 0.928 ## 3 8 [14 × 10] <lm> 0.675
We can permanently save all our models and data in a single object.
write_rds(model_df, "model_df.rds")
The dataframe then becomes a more general data structure to store like objects together on a single row. Not every problem or analysis should be fit into this framework but it is a powerful concept that applies to many problems analysts face.
I don’t think so since pivot_longer
and gather
take many columns and collapse then into one column. Here we have columns of different types that can’t easily be put into a single column. I could be wrong though!
visit_wide %>% pivot_longer(cols = -patid)
## Error: No common type for `visitdate_1` <date> and `blooddraw_1` <integer>.