The tidyverse and functional programming in R

Load a couple libraries

library(tidyverse) 
library(glue) # similar to paste() but more powerful

All of the ideas in this talk come from other people, primarily the R for Data Science iteration chapter.

https://r4ds.had.co.nz/iteration.html

R lets you create tools that work the way you want them to

For example you can define a new ifelse function that treats NA like any other value.

ifelse2 <- function(test, yes, no) ifelse(!is.na(test) & test, yes, no)

tibble(age = c(0, NA, 5, 15, 21, 35)) %>% 
  mutate(ifelse  =  ifelse(age < 18, "Minor", "Adult"),  # Base R version
         ifelse2 = ifelse2(age < 18, "Minor", "Adult"),  # My custom version
         if_else = if_else(age < 18, "Minor", "Adult"))  # Tidyverse version

## # A tibble: 6 x 4
##     age ifelse ifelse2 if_else
##   <dbl> <chr>  <chr>   <chr>  
## 1     0 Minor  Minor   Minor  
## 2    NA <NA>   Adult   <NA>   
## 3     5 Minor  Minor   Minor  
## 4    15 Minor  Minor   Minor  
## 5    21 Adult  Adult   Adult  
## 6    35 Adult  Adult   Adult

The tidyverse developers have re-implemented a lot of base R functionality

if_else has an argument for explicit NA handling

tibble(age = c(0, NA, 5, 15, 21, 35)) %>% 
  mutate(ifelse  =  ifelse(age < 18, "Minor", "Adult"),
         ifelse2 = ifelse2(age < 18, "Minor", "Adult"),  
         if_else = if_else(age < 18, "Minor", "Adult", missing = "We don't know"))

## # A tibble: 6 x 4
##     age ifelse ifelse2 if_else      
##   <dbl> <chr>  <chr>   <chr>        
## 1     0 Minor  Minor   Minor        
## 2    NA <NA>   Adult   We don't know
## 3     5 Minor  Minor   Minor        
## 4    15 Minor  Minor   Minor        
## 5    21 Adult  Adult   Adult        
## 6    35 Adult  Adult   Adult

Strangely, case_when treats NA like any other value

tibble(age = c(0, NA, 5, 15, 21, 35)) %>% 
  mutate(case_when = case_when(
    age < 1 ~ "Baby",     
    age < 18 ~ "Minor",
    T ~ "Adult"))

## # A tibble: 6 x 2
##     age case_when
##   <dbl> <chr>    
## 1     0 Baby     
## 2    NA Adult    
## 3     5 Minor    
## 4    15 Minor    
## 5    21 Adult    
## 6    35 Adult

tibble(age = c(0, NA, 5, 15, 21, 35)) %>% 
  mutate(case_when = case_when(
    age < 1 ~ "Baby",     
    age < 18 ~ "Minor",
    is.na(age) ~ "We don't know",
    T ~ "Adult"))

## # A tibble: 6 x 2
##     age case_when    
##   <dbl> <chr>        
## 1     0 Baby         
## 2    NA We don't know
## 3     5 Minor        
## 4    15 Minor        
## 5    21 Adult        
## 6    35 Adult

The tidyverse functions tend to be more strict than base R

ifelse(1:6 > 3, ">3", c("less", "than 3"))

## [1] "less"   "than 3" "less"   ">3"     ">3"     ">3"

ifelse(1:6 > 3, ">3", 4)

## [1] "4"  "4"  "4"  ">3" ">3" ">3"

if_else(1:6 > 3, ">3", c("less", "than 3"))

## Error: `false` must be length 6 (length of `condition`) or one, not 2

if_else(1:6 > 3, ">3", 4)

## Error: `false` must be a character vector, not a double vector

However this strictness has consequences

if_else(1:6 > 3, ">3", NA)

## Error: `false` must be a character vector, not a logical vector

typeof(NA)

## [1] "logical"

if_else(1:6 > 3, ">3", NA_character_)

## [1] NA   NA   NA   ">3" ">3" ">3"

There are lots of convenience functions in the tidyverse

complete() makes implicit missing values explicit by filling in missing combinations.

tibble(patkey = 1:3, visit_number = 1:3) %>% 
  complete(patkey, visit_number)

## # A tibble: 9 x 2
##   patkey visit_number
##    <int>        <int>
## 1      1            1
## 2      1            2
## 3      1            3
## 4      2            1
## 5      2            2
## 6      2            3
## 7      3            1
## 8      3            2
## 9      3            3

Create some fake visit data

Creating fake data is really helpful for creating a reproducible example (aka reprex)

npats <- 100
nvisits <- 10
create_visit_df <- function(visit_number) {
  tibble(
    patid  = 1:npats,
    visitdate = lubridate::as_date(sample(0:1e4, 100)),
    blooddraw = rbinom(npats, 1, prob = .7),
    bloodrawwho_nurse = blooddraw*rbinom(npats, 1, prob = .7),
    bloodrawwho_other = blooddraw*(1-bloodrawwho_nurse)
  ) %>% 
    rename_at(vars(-patid), ~paste0(., "_", visit_number))
}

visit_wide <- map(1:nvisits, create_visit_df) %>% 
  reduce(inner_join, by = "patid")

One row per patient with grouped columns

The goal: One row per visit

## # A tibble: 1,000 x 5
##    patid visitdate  blooddraw bloodrawwho_nurse bloodrawwho_other
##    <int> <date>         <int>             <int>             <dbl>
##  1     1 1986-09-08         1                 1                 0
##  2     1 1986-10-19         0                 0                 0
##  3     1 1980-08-15         0                 0                 0
##  4     1 1975-11-30         1                 0                 1
##  5     1 1993-03-26         0                 0                 0
##  6     1 1994-03-24         1                 0                 1
##  7     1 1973-05-16         0                 0                 0
##  8     1 1995-04-21         1                 0                 1
##  9     1 1990-11-17         1                 0                 1
## 10     1 1992-01-12         1                 1                 0
## # … with 990 more rows

Visually…

Lists

In computer science, a list or sequence is an abstract data type that represents a countable number of ordered values, where the same value may occur more than once.

https://en.wikipedia.org/wiki/List_(abstract_data_type)

Basically it is just a container.

Lists can contain anything

random_stuff <- list(mtcars,
                     qplot(mpg, cyl, data = mtcars), 
                     function(x) x^2, 
                     list("a", "b", "c"))
str(random_stuff, max.level = 1)

## List of 4
##  $ :'data.frame':    32 obs. of  11 variables:
##  $ :List of 9
##   ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
##  $ :function (x)  
##   ..- attr(*, "srcref")= 'srcref' int [1:8] 4 22 4 36 22 36 4 4
##   .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x8026b08> 
##  $ :List of 3

More complex objects in R are often just lists.

Lists that contain objects of one type are especially useful

list(1:3, 4:10, seq(0, 20, by = 2))

## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1]  4  5  6  7  8  9 10
## 
## [[3]]
##  [1]  0  2  4  6  8 10 12 14 16 18 20

Apply a function to each element of a list

Put the output of the element in a new list

numeric_list <- list(1:3, 4:10, seq(0, 20, by = 2))

map(numeric_list, sum)

## [[1]]
## [1] 6
## 
## [[2]]
## [1] 49
## 
## [[3]]
## [1] 110

Apply a custom function

numeric_list <- list(1:3, 4:10, seq(0, 20, by = 2))

L2norm <- function(x) sqrt(sum(x^2))

map(numeric_list, L2norm)

## [[1]]
## [1] 3.741657
## 
## [[2]]
## [1] 19.26136
## 
## [[3]]
## [1] 39.24283

Create a custom function on the fly

Also called an anonymous function since it has no name

numeric_list <- list(1:3, 4:10, seq(0, 20, by = 2))

map(numeric_list, function(x) sqrt(sum(x^2)))

## [[1]]
## [1] 3.741657
## 
## [[2]]
## [1] 19.26136
## 
## [[3]]
## [1] 39.24283

Use the ~ . shortcut to create a custom function on the fly

This shortcut only works in the tidyverse.

numeric_list <- list(1:3, 4:10, seq(0, 20, by = 2))

map(numeric_list, ~sqrt(sum(.^2)))

## [[1]]
## [1] 3.741657
## 
## [[2]]
## [1] 19.26136
## 
## [[3]]
## [1] 39.24283

Reduce a list to a single value

Use a function that takes two arguments and returns one value to reduce a list.

+ is such a function

numeric_list <- list(1:3, 4:10, seq(0, 20, by = 2))

map(numeric_list, ~sqrt(sum(.^2))) %>% 
  reduce(`+`)

## [1] 62.24585

Another example using glue for string concatenation

Anything inside {} is evaluated as code

map(1:3, ~glue("_{.}")) %>% 
  reduce(c)

## [1] "_1" "_2" "_3"

Variants of map will do the reduce step for us

map_chr returns a character vector.

There are many other variants of the map function.

map_chr(1:3, ~glue("_{.}"))

## [1] "_1" "_2" "_3"

And just for the fun of it…

Back to the problem at hand

We can easily select all variables for one visit

select(visit_wide, patid, ends_with("_1"))

## # A tibble: 100 x 5
##    patid visitdate_1 blooddraw_1 bloodrawwho_nurse_1 bloodrawwho_other_1
##    <int> <date>            <int>               <int>               <dbl>
##  1     1 1986-09-08            1                   1                   0
##  2     2 1973-07-28            0                   0                   0
##  3     3 1980-10-20            1                   1                   0
##  4     4 1997-03-23            1                   1                   0
##  5     5 1970-03-08            1                   1                   0
##  6     6 1991-11-23            0                   0                   0
##  7     7 1973-02-05            0                   0                   0
##  8     8 1983-04-28            1                   0                   1
##  9     9 1981-03-06            1                   1                   0
## 10    10 1982-12-08            1                   1                   0
## # … with 90 more rows

Split the dataframe into a list of dataframes

df_list <- map(1:nvisits, ~select(visit_wide, patid, ends_with(glue("_{.}"))))

str(df_list, max.level = 1)

## List of 10
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  5 variables:
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  5 variables:
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  5 variables:
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  5 variables:
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  5 variables:
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  5 variables:
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  5 variables:
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  5 variables:
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  5 variables:
##  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    100 obs. of  5 variables:

The first visit

df_list[[1]]

## # A tibble: 100 x 5
##    patid visitdate_1 blooddraw_1 bloodrawwho_nurse_1 bloodrawwho_other_1
##    <int> <date>            <int>               <int>               <dbl>
##  1     1 1986-09-08            1                   1                   0
##  2     2 1973-07-28            0                   0                   0
##  3     3 1980-10-20            1                   1                   0
##  4     4 1997-03-23            1                   1                   0
##  5     5 1970-03-08            1                   1                   0
##  6     6 1991-11-23            0                   0                   0
##  7     7 1973-02-05            0                   0                   0
##  8     8 1983-04-28            1                   0                   1
##  9     9 1981-03-06            1                   1                   0
## 10    10 1982-12-08            1                   1                   0
## # … with 90 more rows

The second visit

df_list[[2]]

## # A tibble: 100 x 5
##    patid visitdate_2 blooddraw_2 bloodrawwho_nurse_2 bloodrawwho_other_2
##    <int> <date>            <int>               <int>               <dbl>
##  1     1 1986-10-19            0                   0                   0
##  2     2 1985-01-26            1                   0                   1
##  3     3 1984-04-29            0                   0                   0
##  4     4 1991-07-18            0                   0                   0
##  5     5 1972-08-25            1                   0                   1
##  6     6 1978-08-27            0                   0                   0
##  7     7 1975-12-11            1                   1                   0
##  8     8 1981-11-08            1                   0                   1
##  9     9 1975-08-18            1                   0                   1
## 10    10 1974-08-14            0                   0                   0
## # … with 90 more rows

We need to rename the columns

df_list2 <- map(df_list, ~rename_all(., ~str_remove(., "_[:digit:]+$")))

map(df_list2, names)

## [[1]]
## [1] "patid"             "visitdate"         "blooddraw"        
## [4] "bloodrawwho_nurse" "bloodrawwho_other"
## 
## [[2]]
## [1] "patid"             "visitdate"         "blooddraw"        
## [4] "bloodrawwho_nurse" "bloodrawwho_other"
## 
## [[3]]
## [1] "patid"             "visitdate"         "blooddraw"        
## [4] "bloodrawwho_nurse" "bloodrawwho_other"
## 
## [[4]]
## [1] "patid"             "visitdate"         "blooddraw"        
## [4] "bloodrawwho_nurse" "bloodrawwho_other"
## 
## [[5]]
## [1] "patid"             "visitdate"         "blooddraw"        
## [4] "bloodrawwho_nurse" "bloodrawwho_other"
## 
## [[6]]
## [1] "patid"             "visitdate"         "blooddraw"        
## [4] "bloodrawwho_nurse" "bloodrawwho_other"
## 
## [[7]]
## [1] "patid"             "visitdate"         "blooddraw"        
## [4] "bloodrawwho_nurse" "bloodrawwho_other"
## 
## [[8]]
## [1] "patid"             "visitdate"         "blooddraw"        
## [4] "bloodrawwho_nurse" "bloodrawwho_other"
## 
## [[9]]
## [1] "patid"             "visitdate"         "blooddraw"        
## [4] "bloodrawwho_nurse" "bloodrawwho_other"
## 
## [[10]]
## [1] "patid"             "visitdate"         "blooddraw"        
## [4] "bloodrawwho_nurse" "bloodrawwho_other"

Finally we reduce using bind_rows

bind_rows takes two dataframes and rowbinds them reducing the list with bindrows will iteratively apply bind_rows so we end up with one dataframe

reduce(df_list2, bind_rows)

## # A tibble: 1,000 x 5
##    patid visitdate  blooddraw bloodrawwho_nurse bloodrawwho_other
##    <int> <date>         <int>             <int>             <dbl>
##  1     1 1986-09-08         1                 1                 0
##  2     2 1973-07-28         0                 0                 0
##  3     3 1980-10-20         1                 1                 0
##  4     4 1997-03-23         1                 1                 0
##  5     5 1970-03-08         1                 1                 0
##  6     6 1991-11-23         0                 0                 0
##  7     7 1973-02-05         0                 0                 0
##  8     8 1983-04-28         1                 0                 1
##  9     9 1981-03-06         1                 1                 0
## 10    10 1982-12-08         1                 1                 0
## # … with 990 more rows

Using the pipe %>% we can do all this three lines

map(1:nvisits, ~select(visit_wide, patid, ends_with(glue("_{.}")))) %>% 
  map(~rename_all(., ~str_remove(., "_[:digit:]+$"))) %>% 
  reduce(bind_rows)

## # A tibble: 1,000 x 5
##    patid visitdate  blooddraw bloodrawwho_nurse bloodrawwho_other
##    <int> <date>         <int>             <int>             <dbl>
##  1     1 1986-09-08         1                 1                 0
##  2     2 1973-07-28         0                 0                 0
##  3     3 1980-10-20         1                 1                 0
##  4     4 1997-03-23         1                 1                 0
##  5     5 1970-03-08         1                 1                 0
##  6     6 1991-11-23         0                 0                 0
##  7     7 1973-02-05         0                 0                 0
##  8     8 1983-04-28         1                 0                 1
##  9     9 1981-03-06         1                 1                 0
## 10    10 1982-12-08         1                 1                 0
## # … with 990 more rows

And since bind_rows will accept a list we don’t even need reduce in this example

map(1:nvisits, ~select(visit_wide, patid, ends_with(glue("_{.}")))) %>% 
  map(~rename_all(., ~str_remove(., "_[:digit:]+$"))) %>% 
  bind_rows()

## # A tibble: 1,000 x 5
##    patid visitdate  blooddraw bloodrawwho_nurse bloodrawwho_other
##    <int> <date>         <int>             <int>             <dbl>
##  1     1 1986-09-08         1                 1                 0
##  2     2 1973-07-28         0                 0                 0
##  3     3 1980-10-20         1                 1                 0
##  4     4 1997-03-23         1                 1                 0
##  5     5 1970-03-08         1                 1                 0
##  6     6 1991-11-23         0                 0                 0
##  7     7 1973-02-05         0                 0                 0
##  8     8 1983-04-28         1                 0                 1
##  9     9 1981-03-06         1                 1                 0
## 10    10 1982-12-08         1                 1                 0
## # … with 990 more rows

Using the map_dfr variant we can do it in two lines

map(1:nvisits, ~select(visit_wide, patid, ends_with(glue("_{.}")))) %>% 
  map_dfr(~rename_all(., ~str_remove(., "_[:digit:]+$")))

## # A tibble: 1,000 x 5
##    patid visitdate  blooddraw bloodrawwho_nurse bloodrawwho_other
##    <int> <date>         <int>             <int>             <dbl>
##  1     1 1986-09-08         1                 1                 0
##  2     2 1973-07-28         0                 0                 0
##  3     3 1980-10-20         1                 1                 0
##  4     4 1997-03-23         1                 1                 0
##  5     5 1970-03-08         1                 1                 0
##  6     6 1991-11-23         0                 0                 0
##  7     7 1973-02-05         0                 0                 0
##  8     8 1983-04-28         1                 0                 1
##  9     9 1981-03-06         1                 1                 0
## 10    10 1982-12-08         1                 1                 0
## # … with 990 more rows

Add a visit number column

map2 is similar to map but loops over two arguments .x and .y in parallel.

In this case we are looping over our datafame list and the numbers 1 to 10.

map(1:nvisits, ~select(visit_wide, patid, ends_with(glue("_{.}")))) %>% 
  map2(1:nvisits, ~mutate(.x, visit_number = .y)) %>% 
  map_dfr(~rename_all(., ~str_remove(., "_[:digit:]+$")))

## # A tibble: 1,000 x 6
##    patid visitdate  blooddraw bloodrawwho_nur… bloodrawwho_oth…
##    <int> <date>         <int>            <int>            <dbl>
##  1     1 1986-09-08         1                1                0
##  2     2 1973-07-28         0                0                0
##  3     3 1980-10-20         1                1                0
##  4     4 1997-03-23         1                1                0
##  5     5 1970-03-08         1                1                0
##  6     6 1991-11-23         0                0                0
##  7     7 1973-02-05         0                0                0
##  8     8 1983-04-28         1                0                1
##  9     9 1981-03-06         1                1                0
## 10    10 1982-12-08         1                1                0
## # … with 990 more rows, and 1 more variable: visit_number <int>

It even works with a database

Create a SQLite database in memory and load our visit data into it

library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "visit_wide", visit_wide)
visit_db <- tbl(con, "visit_wide")

R writes the SQL for us and sends it to the database

We just change bind_rows to union_all

map(1:nvisits, ~select(visit_db, patid, ends_with(glue("_{.}")))) %>% 
  map(~rename_all(., ~str_remove(., "_[:digit:]+$"))) %>% 
  reduce(union_all)

## # Source:   lazy query [?? x 5]
## # Database: sqlite 3.22.0 [:memory:]
##    patid visitdate blooddraw bloodrawwho_nurse bloodrawwho_other
##    <int>     <dbl>     <int>             <int>             <dbl>
##  1     1      6094         1                 1                 0
##  2     2      1304         0                 0                 0
##  3     3      3945         1                 1                 0
##  4     4      9943         1                 1                 0
##  5     5        66         1                 1                 0
##  6     6      7996         0                 0                 0
##  7     7      1131         0                 0                 0
##  8     8      4865         1                 0                 1
##  9     9      4082         1                 1                 0
## 10    10      4724         1                 1                 0
## # … with more rows

map saved us from having to write a lot of SQL

map(1:nvisits, ~select(visit_db, patid, ends_with(glue("_{.}")))) %>% 
  map(~rename_all(., ~str_remove(., "_[:digit:]+$"))) %>% 
  reduce(union_all) %>% 
  show_query()

## <SQL>
## SELECT `patid`, `visitdate_1` AS `visitdate`, `blooddraw_1` AS `blooddraw`, `bloodrawwho_nurse_1` AS `bloodrawwho_nurse`, `bloodrawwho_other_1` AS `bloodrawwho_other`
## FROM `visit_wide`
## UNION ALL
## SELECT `patid`, `visitdate_2` AS `visitdate`, `blooddraw_2` AS `blooddraw`, `bloodrawwho_nurse_2` AS `bloodrawwho_nurse`, `bloodrawwho_other_2` AS `bloodrawwho_other`
## FROM `visit_wide`
## UNION ALL
## SELECT `patid`, `visitdate_3` AS `visitdate`, `blooddraw_3` AS `blooddraw`, `bloodrawwho_nurse_3` AS `bloodrawwho_nurse`, `bloodrawwho_other_3` AS `bloodrawwho_other`
## FROM `visit_wide`
## UNION ALL
## SELECT `patid`, `visitdate_4` AS `visitdate`, `blooddraw_4` AS `blooddraw`, `bloodrawwho_nurse_4` AS `bloodrawwho_nurse`, `bloodrawwho_other_4` AS `bloodrawwho_other`
## FROM `visit_wide`
## UNION ALL
## SELECT `patid`, `visitdate_5` AS `visitdate`, `blooddraw_5` AS `blooddraw`, `bloodrawwho_nurse_5` AS `bloodrawwho_nurse`, `bloodrawwho_other_5` AS `bloodrawwho_other`
## FROM `visit_wide`
## UNION ALL
## SELECT `patid`, `visitdate_6` AS `visitdate`, `blooddraw_6` AS `blooddraw`, `bloodrawwho_nurse_6` AS `bloodrawwho_nurse`, `bloodrawwho_other_6` AS `bloodrawwho_other`
## FROM `visit_wide`
## UNION ALL
## SELECT `patid`, `visitdate_7` AS `visitdate`, `blooddraw_7` AS `blooddraw`, `bloodrawwho_nurse_7` AS `bloodrawwho_nurse`, `bloodrawwho_other_7` AS `bloodrawwho_other`
## FROM `visit_wide`
## UNION ALL
## SELECT `patid`, `visitdate_8` AS `visitdate`, `blooddraw_8` AS `blooddraw`, `bloodrawwho_nurse_8` AS `bloodrawwho_nurse`, `bloodrawwho_other_8` AS `bloodrawwho_other`
## FROM `visit_wide`
## UNION ALL
## SELECT `patid`, `visitdate_9` AS `visitdate`, `blooddraw_9` AS `blooddraw`, `bloodrawwho_nurse_9` AS `bloodrawwho_nurse`, `bloodrawwho_other_9` AS `bloodrawwho_other`
## FROM `visit_wide`
## UNION ALL
## SELECT `patid`, `visitdate_10` AS `visitdate`, `blooddraw_10` AS `blooddraw`, `bloodrawwho_nurse_10` AS `bloodrawwho_nurse`, `bloodrawwho_other_10` AS `bloodrawwho_other`
## FROM `visit_wide`

The main takeaway

The map and reduce functions allow us to program on a higher level of abstraction. As the number of visits increases SQL requires more code while the R code does not need to change. map and reduce basically just write for loops for you.

We can think of the scoped variants of dplyr in much the same way.

mutate_if(); mutate_at(); mutate_all()
select_if(); select_at(); select_all()
filter_if(); filter_at(); filter_all()
rename_if(); rename_at(); rename_all()
summarise_if(); summarise_at(); summarise_all()

These functions map over a dataframe which is a list of columns.

Extensions: Functions as data

lst(min, mean, median, max, sd) %>% 
  map_dfr(~map(mtcars, .), .id = "summary")

## # A tibble: 5 x 12
##   summary   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 min     10.4   4     71.1  52   2.76  1.51  14.5  0     0     3      1   
## 2 mean    20.1   6.19 231.  147.  3.60  3.22  17.8  0.438 0.406 3.69   2.81
## 3 median  19.2   6    196.  123   3.70  3.32  17.7  0     0     4      2   
## 4 max     33.9   8    472   335   4.93  5.42  22.9  1     1     5      8   
## 5 sd       6.03  1.79 124.   68.6 0.535 0.978  1.79 0.504 0.499 0.738  1.62

This works because a dataframe is a list. It is a list of columns.

is.list(mtcars)

## [1] TRUE

Extensions: List-columns in a dataframe

Since list can contain other lists we can put anything inside a dataframe as a list column.

Here we split the mtcars dataframe by the cyl variable and fit a linear model to each subset. Then we pull out the \(R^2\) value from each model.

model_df <- mtcars %>% 
  nest(data = -cyl) %>% 
  mutate(model = map(data, ~lm(mpg ~ ., data = .))) %>% 
  mutate(r_squared = map_dbl(model, ~summary(.)$r.squared))

model_df

## # A tibble: 3 x 4
##     cyl            data model  r_squared
##   <dbl> <list<df[,10]>> <list>     <dbl>
## 1     6        [7 × 10] <lm>       1    
## 2     4       [11 × 10] <lm>       0.928
## 3     8       [14 × 10] <lm>       0.675

Extensions: Many models idea from R4ds

We can permanently save all our models and data in a single object.

write_rds(model_df, "model_df.rds")

The dataframe then becomes a more general data structure to store like objects together on a single row. Not every problem or analysis should be fit into this framework but it is a powerful concept that applies to many problems analysts face.

https://r4ds.had.co.nz/many-models.html

Could this transposition have been done with pivot_longer() or gather()?

I don’t think so since pivot_longer and gather take many columns and collapse then into one column. Here we have columns of different types that can’t easily be put into a single column. I could be wrong though!

visit_wide %>% 
  pivot_longer(cols = -patid)

## Error: No common type for `visitdate_1` <date> and `blooddraw_1` <integer>.