In this tutorial, we will learn how to use dplyr’s across() function to compute means of all columns in a dataframe. In R, we can use many approaches to compute column means. Here we will use tidyverse approach using dplyr’s across() function to compute column wise means.
We will see two examples, first we will compute column-wise mean values when there is no missing values in the dataframe. Then in the second example, we will learn how to compute mean values when there is missing values in the dataframe.
The basic approach is that we will first select all numerical columns from a dataframe and compute summary statistics like mean.
library(tidyverse)
penguins %>% head() ## # A tibble: 6 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> ## 1 Adelie Torge… 39.1 18.7 181 3750 male ## 2 Adelie Torge… 39.5 17.4 186 3800 fema… ## 3 Adelie Torge… 40.3 18 195 3250 fema… ## 4 Adelie Torge… NA NA NA NA <NA> ## 5 Adelie Torge… 36.7 19.3 193 3450 fema… ## 6 Adelie Torge… 39.3 20.6 190 3650 male ## # … with 1 more variable: year <int>
How to compute Column means using dplyr’s across() function
Here we use summarize function() in combination with dplyr’s across() function to compute column mean. Within across() we select numerical columns using where(is.numeric) as shown below.
Note we compute column means after removing any row with NAs.
penguins %>% drop_na() %>% summarise(across(where(is.numeric), mean)) ## # A tibble: 1 × 5 ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 44.0 17.2 201. 4207. 2008.
Before dplyr version 1.0.0, we did not have to use where() as wrapper while selecting columns. Therefore, if we don’t use where() to select columns, we get the following warning.
penguins %>% drop_na() %>% summarise(across(is.numeric, mean)) ## Warning: Predicate functions must be wrapped in `where()`. ## ## # Bad ## data %>% select(is.numeric) ## ## # Good ## data %>% select(where(is.numeric)) ## ## ℹ Please update your code. ## This message is displayed once per session.
dplyr’s across(): column means with missing values.
In the previous example, we used drop_na() to remove all rows with any number of missing values. Sometimes you may not want to ignore all the rows with missing values. We would like to compute column means with missing values. To compute column means when there are NAs, we can use na.rm=TRUE as shown below.
penguins %>% summarise(across(where(is.numeric), mean, na.rm=TRUE)) ## # A tibble: 1 × 5 ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 43.9 17.2 201. 4202. 2008.
Note that the mean values are slightly different as not all columns have the same NAs.
[…] all numerical columns, we will get a dataframe with a single row. In the example below, we compute column-wise mean on all numerical columns using dplyr’s across() function. This would result in a dataframe with single […]