How to apply a function on multiple columns using across()

In this post, we will learn how to compute one or multiple functions on multiple columns using dplyr’s across() function. dplyr’s across() function can be used with summarize() or mutate() functions to operate on columns. In this example we will use summarize() function to compute mean values of multiple columns at the same time using across() function.

library(tidyverse)
packageVersion("dplyr")
[1] '1.1.2'

We will use the iris dataset that is built-in with R.

iris %>% head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

One of the naive ways to compute some transformation of multiple columns is to use each column separately as shown below. In this example, we compute mean values of three columns using three separate calls for each column.

iris %>%
  summarize(sepal_length_m = mean(Sepal.Length),
            sepal_width_m = mean(Sepal.Width),
            petal_length_m = mean(Petal.Length))

  sepal_length_m sepal_width_m petal_length_m
1       5.843333      3.057333          3.758

Although this is useful for computing summary function with few columns, it gets cumbersome when we want to do multiple columns or multiple functions.

dplyr’s across() function is made to make computing across columns easier. With across(), We can apply one or more functions on the columns of interest. For example we can compute mean values of the three columns as in the previous example as

iris %>%
  summarize(across(Sepal.Length:Petal.Length, mean))

 Sepal.Length Sepal.Width Petal.Length
1     5.843333    3.057333        3.758

In the above example, we provided two arguments to across() function, first argument is the columns of interest and the function we want to use to transform.

Here is a more formal way to use the across() function to compute means of multiple columns. Here we specify the names of the arguments, .cols for specifying the columns we want to compute and .fns to define the function that we want to compute.

iris %>%
  summarize(across(.cols=Sepal.Length:Petal.Length,
                   .fns = ~ mean(.x, na.rm=TRUE)))

  Sepal.Length Sepal.Width Petal.Length
1     5.843333    3.057333        3.758

We can also write custom functions to apply on each column. In the example given below we compute the sum of squared deviation from mean using lambda function notation.

iris %>%
  summarize(across(.cols= Sepal.Length:Petal.Length, 
                   .fns = ~sum((.x-mean(.x))^2)))

  Sepal.Length Sepal.Width Petal.Length
1     102.1683    28.30693     464.3254

If we wanted to compute the function per each group of another categorical variable we will use group_by() function first on the variable and then apply across() function.

Here we compute mean values for all the columns in the data for each species.

iris %>%
  group_by(Species) %>%
  summarize(across(Sepal.Length:Petal.Length, mean))

# A tibble: 3 × 4
  Species    Sepal.Length Sepal.Width Petal.Length
  <fct>             <dbl>       <dbl>        <dbl>
1 setosa             5.01        3.43         1.46
2 versicolor         5.94        2.77         4.26
3 virginica          6.59        2.97         5.55

Related