In this tutorial, we will learn how to split a dataframe into a list of dataframes by groups in R. We will first learn how to use the base R function, split(), to divide a dataframe into multiple dataframes into a list. Then, we will learn how to use dplyr’s group_split() function to do the same.
To get started, we will first load tidyverse, a suite R packages, and palmer penguins for using the penguins data.
library(tidyverse) # check the version of loaded package dplyr packageVersion("dplyr") ## [1] '1.0.8' library(palmerpenguins)
How to Split a Dataframe into a list of Dataframes by groups using split() in base R
split() function in base R divides the data in a vector or a dataframe into a list of groups. Here we show how to split a dataframe by group
list_of_dataframes_by_split <- split(penguins, penguins$species)
Looking at the structure of the resulting object from split(), we can see that it is a list containing 3 elements, with each element is a dataframe.
str(list_of_dataframes_by_split) ## List of 3 ## $ Adelie : tibble [152 × 8] (S3: tbl_df/tbl/data.frame) ## ..$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ... ## ..$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ... ## ..$ bill_length_mm : num [1:152] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ... ## ..$ bill_depth_mm : num [1:152] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ... ## ..$ flipper_length_mm: int [1:152] 181 186 195 NA 193 190 181 195 193 190 ... ## ..$ body_mass_g : int [1:152] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ... ## ..$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ... ## ..$ year : int [1:152] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ... ## $ Chinstrap: tibble [68 × 8] (S3: tbl_df/tbl/data.frame) ## ..$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 2 2 2 2 2 2 2 2 2 2 ... ## ..$ island : Factor w/ 3 levels "Biscoe","Dream",..: 2 2 2 2 2 2 2 2 2 2 ... ## ..$ bill_length_mm : num [1:68] 46.5 50 51.3 45.4 52.7 45.2 46.1 51.3 46 51.3 ... ## ..$ bill_depth_mm : num [1:68] 17.9 19.5 19.2 18.7 19.8 17.8 18.2 18.2 18.9 19.9 ... ## ..$ flipper_length_mm: int [1:68] 192 196 193 188 197 198 178 197 195 198 ... ## ..$ body_mass_g : int [1:68] 3500 3900 3650 3525 3725 3950 3250 3750 4150 3700 ... ## ..$ sex : Factor w/ 2 levels "female","male": 1 2 2 1 2 1 1 2 1 2 ... ## ..$ year : int [1:68] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ... ## $ Gentoo : tibble [124 × 8] (S3: tbl_df/tbl/data.frame) ## ..$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 3 3 3 3 3 3 3 3 3 3 ... ## ..$ island : Factor w/ 3 levels "Biscoe","Dream",..: 1 1 1 1 1 1 1 1 1 1 ... ## ..$ bill_length_mm : num [1:124] 46.1 50 48.7 50 47.6 46.5 45.4 46.7 43.3 46.8 ... ## ..$ bill_depth_mm : num [1:124] 13.2 16.3 14.1 15.2 14.5 13.5 14.6 15.3 13.4 15.4 ... ## ..$ flipper_length_mm: int [1:124] 211 230 210 218 215 210 211 219 209 215 ... ## ..$ body_mass_g : int [1:124] 4500 5700 4450 5700 5400 4550 4800 5200 4400 5150 ... ## ..$ sex : Factor w/ 2 levels "female","male": 1 2 1 2 2 1 1 2 1 2 ... ## ..$ year : int [1:124] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
How to Split a Dataframe into a list of Dataframes by groups using group_split() in dplyr
dplyr has an experimental function group_split() that behaves very much like base R’s split() function. We can use group_split() in two ways. In this example below, we provide the dataframe and the grouping variable to split the dataframe into a list of smaller data frames.
penguins %>% group_split(species) ## <list_of< ## tbl_df< ## species : factor<b22a0> ## island : factor<ccf33> ## bill_length_mm : double ## bill_depth_mm : double ## flipper_length_mm: integer ## body_mass_g : integer ## sex : factor<8f119> ## year : integer ## > ## >[3]> ## [[1]] ## # A tibble: 152 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## 7 Adelie Torgersen 38.9 17.8 181 3625 ## 8 Adelie Torgersen 39.2 19.6 195 4675 ## 9 Adelie Torgersen 34.1 18.1 193 3475 ## 10 Adelie Torgersen 42 20.2 190 4250 ## # … with 142 more rows, and 2 more variables: sex <fct>, year <int> ## ## [[2]] ## # A tibble: 68 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Chinstrap Dream 46.5 17.9 192 3500 ## 2 Chinstrap Dream 50 19.5 196 3900 ## 3 Chinstrap Dream 51.3 19.2 193 3650 ## 4 Chinstrap Dream 45.4 18.7 188 3525 ## 5 Chinstrap Dream 52.7 19.8 197 3725 ## 6 Chinstrap Dream 45.2 17.8 198 3950 ## 7 Chinstrap Dream 46.1 18.2 178 3250 ## 8 Chinstrap Dream 51.3 18.2 197 3750 ## 9 Chinstrap Dream 46 18.9 195 4150 ## 10 Chinstrap Dream 51.3 19.9 198 3700 ## # … with 58 more rows, and 2 more variables: sex <fct>, year <int> ## ## [[3]] ## # A tibble: 124 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Gentoo Biscoe 46.1 13.2 211 4500 ## 2 Gentoo Biscoe 50 16.3 230 5700 ## 3 Gentoo Biscoe 48.7 14.1 210 4450 ## 4 Gentoo Biscoe 50 15.2 218 5700 ## 5 Gentoo Biscoe 47.6 14.5 215 5400 ## 6 Gentoo Biscoe 46.5 13.5 210 4550 ## 7 Gentoo Biscoe 45.4 14.6 211 4800 ## 8 Gentoo Biscoe 46.7 15.3 219 5200 ## 9 Gentoo Biscoe 43.3 13.4 209 4400 ## 10 Gentoo Biscoe 46.8 15.4 215 5150 ## # … with 114 more rows, and 2 more variables: sex <fct>, year <int>
dplyr’s group_split() function can also work on grouped object, i.e. result from group_by() function in dplyr. For example, here we have grouped object after applying group_by() to the dataframe.
grp_obj <- penguins %>% group_by(species)
Then we can split into a list dataframes using group_split() as shown here and we get the same results as before.
grp_obj %>% group_split()