How to Compute Z-Score of Multiple Columns

In this post, we will learn how to compute Z-score of multiple variables (columns) at the same time using tidyverse in R using multiple approaches.

First, we will show an example of computing Z-score of multiple columns, where all the columns in the dataframe is numeric and then we will show example where we have both numeric and non-numeric columns.

library(tidyverse)
library(palmerpenguins)

penguins |>
  head()

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Compute Z-score on multiple columns: When all columns are numeric

Let us first select all numerical columns in our dataframe using where() and is.numeric() functions.

df <- penguins |>
  drop_na() %>%
  select(-year) |>
  select(where(is.numeric))

Now we have a dataframe where all columns are numeric.

df

# A tibble: 333 × 4
   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
            <dbl>         <dbl>             <int>       <int>
 1           39.1          18.7               181        3750
 2           39.5          17.4               186        3800
 3           40.3          18                 195        3250
 4           36.7          19.3               193        3450
 5           39.3          20.6               190        3650
 6           38.9          17.8               181        3625
 7           39.2          19.6               195        4675
 8           41.1          17.6               182        3200
 9           38.6          21.2               191        3800
10           34.6          21.1               198        4400
# ℹ 323 more rows

We can use scale() function to compute Z-score on all the columns.

df |> 
  scale() |>
  as_tibble()

# A tibble: 333 × 4
   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
            <dbl>         <dbl>             <dbl>       <dbl>
 1         -0.895         0.780            -1.42       -0.568
 2         -0.822         0.119            -1.07       -0.506
 3         -0.675         0.424            -0.426      -1.19 
 4         -1.33          1.08             -0.568      -0.940
 5         -0.858         1.74             -0.782      -0.692
 6         -0.931         0.323            -1.42       -0.723
 7         -0.876         1.24             -0.426       0.581
 8         -0.529         0.221            -1.35       -1.25 
 9         -0.986         2.05             -0.711      -0.506
10         -1.72          2.00             -0.212       0.240
# ℹ 323 more rows

Compute Z-score on multiple columns: When all columns are numeric

Second approach is to compute z-score of all numerical columns of a dataframe where some columns are numeric and others are non-numeric. We will use across() function with mutate() function to select numerical columns and compute z-scores manually as shown below.

penguins |>
  drop_na() |>
  select(-year) |>
  mutate(across(where(is.numeric), ~ (.-mean(.)) / sd(.)))

# A tibble: 333 × 7
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen         -0.895         0.780            -1.42       -0.568
 2 Adelie  Torgersen         -0.822         0.119            -1.07       -0.506
 3 Adelie  Torgersen         -0.675         0.424            -0.426      -1.19 
 4 Adelie  Torgersen         -1.33          1.08             -0.568      -0.940
 5 Adelie  Torgersen         -0.858         1.74             -0.782      -0.692
 6 Adelie  Torgersen         -0.931         0.323            -1.42       -0.723
 7 Adelie  Torgersen         -0.876         1.24             -0.426       0.581
 8 Adelie  Torgersen         -0.529         0.221            -1.35       -1.25 
 9 Adelie  Torgersen         -0.986         2.05             -0.711      -0.506
10 Adelie  Torgersen         -1.72          2.00             -0.212       0.240
# ℹ 323 more rows
# ℹ 1 more variable: sex <fct>

Third approach is similar to the one above, but this time we will use scale() function with across() instead of computing Z-score manually

penguins |>
  drop_na() |>
  select(-year) |>
  mutate(across(where(is.numeric), ~ scale(.)[, 1]))

# A tibble: 333 × 7
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen         -0.895         0.780            -1.42       -0.568
 2 Adelie  Torgersen         -0.822         0.119            -1.07       -0.506
 3 Adelie  Torgersen         -0.675         0.424            -0.426      -1.19 
 4 Adelie  Torgersen         -1.33          1.08             -0.568      -0.940
 5 Adelie  Torgersen         -0.858         1.74             -0.782      -0.692
 6 Adelie  Torgersen         -0.931         0.323            -1.42       -0.723
 7 Adelie  Torgersen         -0.876         1.24             -0.426       0.581
 8 Adelie  Torgersen         -0.529         0.221            -1.35       -1.25 
 9 Adelie  Torgersen         -0.986         2.05             -0.711      -0.506
10 Adelie  Torgersen         -1.72          2.00             -0.212       0.240
# ℹ 323 more rows
# ℹ 1 more variable: sex <fct>

Compute Z-score on multiple columns: When all columns are numeric

Compute Z-score on multiple columns: When all columns are numeric

Related