dplyr across(): Compute column-wise mean

In this tutorial, we will learn how to use dplyr’s across() function to compute means of all columns in a dataframe. In R, we can use many approaches to compute column means. Here we will use tidyverse approach using dplyr’s across() function to compute column wise means.

We will see two examples, first we will compute column-wise mean values when there is no missing values in the dataframe. Then in the second example, we will learn how to compute mean values when there is missing values in the dataframe.

The basic approach is that we will first select all numerical columns from a dataframe and compute summary statistics like mean.

library(tidyverse)

penguins %>% head()

## # A tibble: 6 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <int>

How to compute Column means using dplyr’s across() function

Here we use summarize function() in combination with dplyr’s across() function to compute column mean. Within across() we select numerical columns using where(is.numeric) as shown below.

Note we compute column means after removing any row with NAs.

penguins %>% 
  drop_na() %>%
  summarise(across(where(is.numeric), mean))

## # A tibble: 1 × 5
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##            <dbl>         <dbl>             <dbl>       <dbl> <dbl>
## 1           44.0          17.2              201.       4207. 2008.

Before dplyr version 1.0.0, we did not have to use where() as wrapper while selecting columns. Therefore, if we don’t use where() to select columns, we get the following warning.

penguins %>% 
  drop_na() %>%
  summarise(across(is.numeric, mean))

## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.numeric)
## 
##   # Good
##   data %>% select(where(is.numeric))
## 
## ℹ Please update your code.
## This message is displayed once per session.

dplyr’s across(): column means with missing values.

In the previous example, we used drop_na() to remove all rows with any number of missing values. Sometimes you may not want to ignore all the rows with missing values. We would like to compute column means with missing values. To compute column means when there are NAs, we can use na.rm=TRUE as shown below.

penguins %>% 
  summarise(across(where(is.numeric), 
            mean, 
            na.rm=TRUE))

## # A tibble: 1 × 5
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##            <dbl>         <dbl>             <dbl>       <dbl> <dbl>
## 1           43.9          17.2              201.       4202. 2008.

Note that the mean values are slightly different as not all columns have the same NAs.

Trackbacks

pivot_longer on dataframe with single row - Rstats 101 says:

October 30, 2022 at 4:42 am

[…] all numerical columns, we will get a dataframe with a single row. In the example below, we compute column-wise mean on all numerical columns using dplyr’s across() function. This would result in a dataframe with single […]

How to compute Column means using dplyr’s across() function

dplyr’s across(): column means with missing values.

Related

Reader Interactions

Trackbacks