In this tutorial, we will learn how to build many simple linear regression models in R using tidyverse. We will start with a simple approach that is not scalable to illustrate the challenges. Then we will show a naive approach using for loop to build many linear regression models. And finally we will show an elegant solution of building many linear regression models using map() function in purrr in tidyverse.
Let us load the packages needed . We will use penguins data to build many linear regression models with lm().
library(tidyverse) library(palmerpenguins) theme_set(theme_bw(16))
penguins %>% head() # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 # ℹ 2 more variables: sex <fct>, year <int>
Let us say we are interested in building linear regression models between bill length and flipper length for all three penguin species in the data. Here we just visualize the linear association of these two quantitative variables.
penguins %>% ggplot(aes(bill_length_mm, flipper_length_mm))+ geom_point()+ facet_wrap(~species)
Naive approach to build many linear models
In our example we need to build three simple linear regression models. A naive approach is to build 3 linear regression models one by one manually. To do that, we might filter the orginal data to a subset we want to build one linear regression model as shown below. Here we get data for one species
Adelie_df <- penguins %>% filter(species=="Adelie")
And build linear regression model for one species using lm() function in R.
lm_Adelie = lm(bill_length_mm ~flipper_length_mm, data=Adelie_df)
We can then get the results from the linear regression model. Here we find that bill length is strongly associated with flipper length in Adelie penguins.
lm_Adelie %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 13.6 6.00 2.27 0.0249 2 flipper_length_mm 0.133 0.0315 4.21 0.0000447
With the naive approach, we will do the same but now for the second species.
Chinstrap_df <- penguins %>% filter(species=="Chinstrap") lm_Chinstrap = lm(bill_length_mm ~flipper_length_mm, data=Chinstrap_df) lm_Chinstrap %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 5.59 9.96 0.562 0.576 2 flipper_length_mm 0.221 0.0508 4.34 0.0000492
Many linear regression models with a for loop
The naive approach described above is not scalable. A better option is to use loop through each species with a for loop and build linear regression model.
To do that let us first get the list of species that we would like to loop through.
p_species = penguins %>% count(species) %>% pull(species) p_species [1] Adelie Chinstrap Gentoo Levels: Adelie Chinstrap Gentoo
And then we can write a for loop to go over each element in the species vector to build linear regression model.
for (i in seq_along(p_species)){ # get species name s <- p_species[i] # filter data df <- penguins %>% filter(species==s) # fit lm lm_fit <- lm(bill_length_mm ~flipper_length_mm, data=df) lm_fit_tidy <- lm_fit %>% tidy() # print results print(lm_fit_tidy) } # A tibble: 2 × 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 13.6 6.00 2.27 0.0249 2 flipper_length_mm 0.133 0.0315 4.21 0.0000447 # A tibble: 2 × 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 5.59 9.96 0.562 0.576 2 flipper_length_mm 0.221 0.0508 4.34 0.0000492 # A tibble: 2 × 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -20.7 7.04 -2.94 3.88e- 3 2 flipper_length_mm 0.314 0.0324 9.69 8.60e-17
Although building many linear models using a for loop is scalable, it is often difficult to use the results from each model. In this example we have just printed the results for convenience. Often one might want to save the result and access different aspect of the results later.
Many linear models with purrr in tidyverse
Tidyverse’ purrr package can help building many linear models easily and access the results from linear models. In the example below let us walk through step by step on building many models.
The basic idea is to split the data as we like using tidyverse framework using list columns in a dataframe and building many models
The basic idea is that we start with the full dataframe and perform many linear models on a subset of the dataframe. We do that by using “the nested data frame”. We use group_by() to get grouped dataframe and use nest() function to create a nested dataframe.
penguins_by_species <- penguins %>% group_by(species) %>% nest() penguins_by_species # A tibble: 3 × 2 # Groups: species [3] species data <fct> <list> 1 Adelie <tibble [152 × 7]> 2 Gentoo <tibble [124 × 7]> 3 Chinstrap <tibble [68 × 7]>
Our nested dataframe looks like this where one column is our penguin species and the other column has a subset of the dataframe for each grouping variable. Basically, we have created a data frame with a column that is a list of smaller data frames.
penguins_by_species # A tibble: 3 × 2 # Groups: species [3] species data <fct> <list> 1 Adelie <tibble [152 × 7]> 2 Gentoo <tibble [124 × 7]> 3 Chinstrap <tibble [68 × 7]>
We can check single element of the data column as shown below. Here we have a data frame containing Adelei penguins.
penguins_by_species$data[[1]] # A tibble: 152 × 7 island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year <fct> <dbl> <dbl> <int> <int> <fct> <int> 1 Torge… 39.1 18.7 181 3750 male 2007 2 Torge… 39.5 17.4 186 3800 fema… 2007 3 Torge… 40.3 18 195 3250 fema… 2007 4 Torge… NA NA NA NA <NA> 2007 5 Torge… 36.7 19.3 193 3450 fema… 2007 6 Torge… 39.3 20.6 190 3650 male 2007 7 Torge… 38.9 17.8 181 3625 fema… 2007 8 Torge… 39.2 19.6 195 4675 male 2007 9 Torge… 34.1 18.1 193 3475 <NA> 2007 10 Torge… 42 20.2 190 4250 <NA> 2007 # ℹ 142 more rows
Now we are almost ready to fit linear models. Let us first write a helper function which takes a dataframe as input and fits linear regression model on two of the variables in the dataframe.
linear_reg <- function(df){ lm(bill_length_mm ~flipper_length_mm, data=df) }
Then we can use purrr’s map() function to fit many linear regression models by using the nested dataframe and the function to fit linear model.
And we store the result of linear for model for each penguin species as list column variable to the nested dataframe.
m_by_species <- penguins_by_species %>% mutate(lm_fit = map(data, linear_reg))
Just as we could get a peek at the dataframe before, we can access the elements of list columns using similar approach with the column name and double square brackets to get the linear model results.
lm_by_species$lm_fit[[1]] Call: lm(formula = bill_length_mm ~ flipper_length_mm, data = df) Coefficients: (Intercept) flipper_length_mm 13.5871 0.1327
We can go ahead and add further columns, like tidied results using broom’s tidy() function
penguins_by_species %>% mutate(lm_fit = map(data, linear_reg)) %>% mutate(tidy = map(lm_fit, broom::tidy)) %>% unnest(tidy) # A tibble: 6 × 8 # Groups: species [3] species data lm_fit term estimate std.error statistic p.value <fct> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl> 1 Adelie <tibble> <lm> (Int… 13.6 6.00 2.27 2.49e- 2 2 Adelie <tibble> <lm> flip… 0.133 0.0315 4.21 4.47e- 5 3 Gentoo <tibble> <lm> (Int… -20.7 7.04 -2.94 3.88e- 3 4 Gentoo <tibble> <lm> flip… 0.314 0.0324 9.69 8.60e-17 5 Chinstrap <tibble [68 × 7]> <lm> (Int… 5.59 9.96 0.562 5.76e- 1 6 Chinstrap <tibble [68 × 7]> <lm> flip… 0.221 0.0508 4.34 4.92e- 5
And we wanted a specific variable, we can use filter to get the rows we needed.
penguins_by_species %>% mutate(lm_fit = map(data, linear_reg), tidy = map(lm_fit, broom::tidy)) %>% unnest(tidy) %>% filter(term=="flipper_length_mm") # A tibble: 3 × 8 # Groups: species [3] species data lm_fit term estimate std.error statistic p.value <fct> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl> 1 Adelie <tibble> <lm> flip… 0.133 0.0315 4.21 4.47e- 5 2 Gentoo <tibble> <lm> flip… 0.314 0.0324 9.69 8.60e-17 3 Chinstrap <tibble [68 × 7]> <lm> flip… 0.221 0.0508 4.34 4.92e- 5
Here is a quick plot looking at the effect sizes and the std error of the many linear models in a series of steps with purrr.
penguins_by_species %>% mutate(lm_fit = map(data, linear_reg), tidy = map(lm_fit, broom::tidy)) %>% unnest(tidy) %>% filter(term=="flipper_length_mm") %>% ggplot(aes(species, estimate, ymin = estimate - std.error, ymax = estimate + std.error)) + geom_pointrange()