In this post, we will learn how to compute Z-score of multiple variables (columns) at the same time using tidyverse in R using multiple approaches.
First, we will show an example of computing Z-score of multiple columns, where all the columns in the dataframe is numeric and then we will show example where we have both numeric and non-numeric columns.
library(tidyverse) library(palmerpenguins)
penguins |> head() # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 # ℹ 2 more variables: sex <fct>, year <int>
Compute Z-score on multiple columns: When all columns are numeric
Let us first select all numerical columns in our dataframe using where() and is.numeric() functions.
df <- penguins |> drop_na() %>% select(-year) |> select(where(is.numeric))
Now we have a dataframe where all columns are numeric.
df # A tibble: 333 × 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <dbl> <dbl> <int> <int> 1 39.1 18.7 181 3750 2 39.5 17.4 186 3800 3 40.3 18 195 3250 4 36.7 19.3 193 3450 5 39.3 20.6 190 3650 6 38.9 17.8 181 3625 7 39.2 19.6 195 4675 8 41.1 17.6 182 3200 9 38.6 21.2 191 3800 10 34.6 21.1 198 4400 # ℹ 323 more rows
We can use scale() function to compute Z-score on all the columns.
df |> scale() |> as_tibble() # A tibble: 333 × 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <dbl> <dbl> <dbl> <dbl> 1 -0.895 0.780 -1.42 -0.568 2 -0.822 0.119 -1.07 -0.506 3 -0.675 0.424 -0.426 -1.19 4 -1.33 1.08 -0.568 -0.940 5 -0.858 1.74 -0.782 -0.692 6 -0.931 0.323 -1.42 -0.723 7 -0.876 1.24 -0.426 0.581 8 -0.529 0.221 -1.35 -1.25 9 -0.986 2.05 -0.711 -0.506 10 -1.72 2.00 -0.212 0.240 # ℹ 323 more rows
Compute Z-score on multiple columns: When all columns are numeric
Second approach is to compute z-score of all numerical columns of a dataframe where some columns are numeric and others are non-numeric. We will use across() function with mutate() function to select numerical columns and compute z-scores manually as shown below.
penguins |> drop_na() |> select(-year) |> mutate(across(where(is.numeric), ~ (.-mean(.)) / sd(.)))
# A tibble: 333 × 7 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <dbl> <dbl> 1 Adelie Torgersen -0.895 0.780 -1.42 -0.568 2 Adelie Torgersen -0.822 0.119 -1.07 -0.506 3 Adelie Torgersen -0.675 0.424 -0.426 -1.19 4 Adelie Torgersen -1.33 1.08 -0.568 -0.940 5 Adelie Torgersen -0.858 1.74 -0.782 -0.692 6 Adelie Torgersen -0.931 0.323 -1.42 -0.723 7 Adelie Torgersen -0.876 1.24 -0.426 0.581 8 Adelie Torgersen -0.529 0.221 -1.35 -1.25 9 Adelie Torgersen -0.986 2.05 -0.711 -0.506 10 Adelie Torgersen -1.72 2.00 -0.212 0.240 # ℹ 323 more rows # ℹ 1 more variable: sex <fct>
Third approach is similar to the one above, but this time we will use scale() function with across() instead of computing Z-score manually
penguins |> drop_na() |> select(-year) |> mutate(across(where(is.numeric), ~ scale(.)[, 1])) # A tibble: 333 × 7 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <dbl> <dbl> 1 Adelie Torgersen -0.895 0.780 -1.42 -0.568 2 Adelie Torgersen -0.822 0.119 -1.07 -0.506 3 Adelie Torgersen -0.675 0.424 -0.426 -1.19 4 Adelie Torgersen -1.33 1.08 -0.568 -0.940 5 Adelie Torgersen -0.858 1.74 -0.782 -0.692 6 Adelie Torgersen -0.931 0.323 -1.42 -0.723 7 Adelie Torgersen -0.876 1.24 -0.426 0.581 8 Adelie Torgersen -0.529 0.221 -1.35 -1.25 9 Adelie Torgersen -0.986 2.05 -0.711 -0.506 10 Adelie Torgersen -1.72 2.00 -0.212 0.240 # ℹ 323 more rows # ℹ 1 more variable: sex <fct>