Many linear regression models with tidyverse

In this tutorial, we will learn how to build many simple linear regression models in R using tidyverse. We will start with a simple approach that is not scalable to illustrate the challenges. Then we will show a naive approach using for loop to build many linear regression models. And finally we will show an elegant solution of building many linear regression models using map() function in purrr in tidyverse.

Let us load the packages needed . We will use penguins data to build many linear regression models with lm().

library(tidyverse)
library(palmerpenguins)
theme_set(theme_bw(16))

penguins %>% head()

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Let us say we are interested in building linear regression models between bill length and flipper length for all three penguin species in the data. Here we just visualize the linear association of these two quantitative variables.

penguins %>% 
  ggplot(aes(bill_length_mm, flipper_length_mm))+
  geom_point()+
  facet_wrap(~species)

many linear regression model data as scatter plot

Naive approach to build many linear models

In our example we need to build three simple linear regression models. A naive approach is to build 3 linear regression models one by one manually. To do that, we might filter the orginal data to a subset we want to build one linear regression model as shown below. Here we get data for one species

Adelie_df <- penguins %>% 
  filter(species=="Adelie")

And build linear regression model for one species using lm() function in R.

lm_Adelie = lm(bill_length_mm ~flipper_length_mm, data=Adelie_df)

We can then get the results from the linear regression model. Here we find that bill length is strongly associated with flipper length in Adelie penguins.

lm_Adelie %>% tidy()

# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)         13.6      6.00        2.27 0.0249   
2 flipper_length_mm    0.133    0.0315      4.21 0.0000447

With the naive approach, we will do the same but now for the second species.

Chinstrap_df <- penguins %>% 
  filter(species=="Chinstrap") 

lm_Chinstrap = lm(bill_length_mm ~flipper_length_mm, data=Chinstrap_df)
lm_Chinstrap %>% tidy()

# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)          5.59     9.96       0.562 0.576    
2 flipper_length_mm    0.221    0.0508     4.34  0.0000492

Many linear regression models with a for loop

The naive approach described above is not scalable. A better option is to use loop through each species with a for loop and build linear regression model.

To do that let us first get the list of species that we would like to loop through.

p_species  = penguins %>%
  count(species) %>%
  pull(species)
p_species

[1] Adelie    Chinstrap Gentoo   
Levels: Adelie Chinstrap Gentoo

And then we can write a for loop to go over each element in the species vector to build linear regression model.

for (i in seq_along(p_species)){
  # get species name
  s <- p_species[i]
  # filter data
  df <- penguins %>% 
    filter(species==s)
  # fit lm 
  lm_fit <- lm(bill_length_mm ~flipper_length_mm, data=df)
  lm_fit_tidy <- lm_fit %>% tidy()
  # print results
  print(lm_fit_tidy)
}

# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)         13.6      6.00        2.27 0.0249   
2 flipper_length_mm    0.133    0.0315      4.21 0.0000447
# A tibble: 2 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)          5.59     9.96       0.562 0.576    
2 flipper_length_mm    0.221    0.0508     4.34  0.0000492
# A tibble: 2 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -20.7      7.04       -2.94 3.88e- 3
2 flipper_length_mm    0.314    0.0324      9.69 8.60e-17

Although building many linear models using a for loop is scalable, it is often difficult to use the results from each model. In this example we have just printed the results for convenience. Often one might want to save the result and access different aspect of the results later.

Many linear models with purrr in tidyverse

Tidyverse’ purrr package can help building many linear models easily and access the results from linear models. In the example below let us walk through step by step on building many models.

The basic idea is to split the data as we like using tidyverse framework using list columns in a dataframe and building many models

The basic idea is that we start with the full dataframe and perform many linear models on a subset of the dataframe. We do that by using “the nested data frame”. We use group_by() to get grouped dataframe and use nest() function to create a nested dataframe.

penguins_by_species <- penguins %>%
  group_by(species) %>%
  nest()
penguins_by_species

# A tibble: 3 × 2
# Groups:   species [3]
  species   data              
  <fct>     <list>            
1 Adelie    <tibble [152 × 7]>
2 Gentoo    <tibble [124 × 7]>
3 Chinstrap <tibble [68 × 7]>

Our nested dataframe looks like this where one column is our penguin species and the other column has a subset of the dataframe for each grouping variable. Basically, we have created a data frame with a column that is a list of smaller data frames.

penguins_by_species

# A tibble: 3 × 2
# Groups:   species [3]
  species   data              
  <fct>     <list>            
1 Adelie    <tibble [152 × 7]>
2 Gentoo    <tibble [124 × 7]>
3 Chinstrap <tibble [68 × 7]>

We can check single element of the data column as shown below. Here we have a data frame containing Adelei penguins.

penguins_by_species$data[[1]]

# A tibble: 152 × 7
   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex    year
   <fct>           <dbl>         <dbl>             <int>       <int> <fct> <int>
 1 Torge…           39.1          18.7               181        3750 male   2007
 2 Torge…           39.5          17.4               186        3800 fema…  2007
 3 Torge…           40.3          18                 195        3250 fema…  2007
 4 Torge…           NA            NA                  NA          NA <NA>   2007
 5 Torge…           36.7          19.3               193        3450 fema…  2007
 6 Torge…           39.3          20.6               190        3650 male   2007
 7 Torge…           38.9          17.8               181        3625 fema…  2007
 8 Torge…           39.2          19.6               195        4675 male   2007
 9 Torge…           34.1          18.1               193        3475 <NA>   2007
10 Torge…           42            20.2               190        4250 <NA>   2007
# ℹ 142 more rows

Now we are almost ready to fit linear models. Let us first write a helper function which takes a dataframe as input and fits linear regression model on two of the variables in the dataframe.

linear_reg <- function(df){
  lm(bill_length_mm ~flipper_length_mm, data=df)
}

Then we can use purrr’s map() function to fit many linear regression models by using the nested dataframe and the function to fit linear model.

And we store the result of linear for model for each penguin species as list column variable to the nested dataframe.

m_by_species <- penguins_by_species %>%
  mutate(lm_fit = map(data, linear_reg))

Just as we could get a peek at the dataframe before, we can access the elements of list columns using similar approach with the column name and double square brackets to get the linear model results.

lm_by_species$lm_fit[[1]]


Call:
lm(formula = bill_length_mm ~ flipper_length_mm, data = df)

Coefficients:
      (Intercept)  flipper_length_mm  
          13.5871             0.1327

We can go ahead and add further columns, like tidied results using broom’s tidy() function

penguins_by_species %>%
  mutate(lm_fit = map(data, linear_reg)) %>%
  mutate(tidy = map(lm_fit, broom::tidy)) %>%
  unnest(tidy)

# A tibble: 6 × 8
# Groups:   species [3]
  species   data              lm_fit term  estimate std.error statistic  p.value
  <fct>     <list>            <list> <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 Adelie    <tibble>          <lm>   (Int…   13.6      6.00       2.27  2.49e- 2
2 Adelie    <tibble>          <lm>   flip…    0.133    0.0315     4.21  4.47e- 5
3 Gentoo    <tibble>          <lm>   (Int…  -20.7      7.04      -2.94  3.88e- 3
4 Gentoo    <tibble>          <lm>   flip…    0.314    0.0324     9.69  8.60e-17
5 Chinstrap <tibble [68 × 7]> <lm>   (Int…    5.59     9.96       0.562 5.76e- 1
6 Chinstrap <tibble [68 × 7]> <lm>   flip…    0.221    0.0508     4.34  4.92e- 5

And we wanted a specific variable, we can use filter to get the rows we needed.

penguins_by_species %>%
  mutate(lm_fit = map(data, linear_reg),
         tidy = map(lm_fit, broom::tidy)) %>%
  unnest(tidy) %>%
  filter(term=="flipper_length_mm")

# A tibble: 3 × 8
# Groups:   species [3]
  species   data              lm_fit term  estimate std.error statistic  p.value
  <fct>     <list>            <list> <chr>    <dbl>     <dbl>     <dbl>    <dbl>
1 Adelie    <tibble>          <lm>   flip…    0.133    0.0315      4.21 4.47e- 5
2 Gentoo    <tibble>          <lm>   flip…    0.314    0.0324      9.69 8.60e-17
3 Chinstrap <tibble [68 × 7]> <lm>   flip…    0.221    0.0508      4.34 4.92e- 5

Here is a quick plot looking at the effect sizes and the std error of the many linear models in a series of steps with purrr.

penguins_by_species %>%
  mutate(lm_fit = map(data, linear_reg),
         tidy = map(lm_fit, broom::tidy)) %>%
  unnest(tidy) %>%
  filter(term=="flipper_length_mm") %>%
  ggplot(aes(species, estimate,
                  ymin = estimate - std.error,
                  ymax = estimate + std.error)) +
  geom_pointrange()