How to compute Z-score

In this post, we will learn how to compute Z-score in R using two different approaches, first manually by using the z-score formula and then using scale() function available in base R.

What is Z-score

Z-score is a commonly used transformation technique to standardize/normalize a numerical variable. Transforming numerical variables into Z-scores enable us to make comparison across the variables easily.

At the core Z-score is a statistical measure that describes how far a data point is from the mean of a variable, expressed in terms of standard deviations. It is a great way used to understand how unusual or typical a particular data point is within a distribution.

What is Z-score?
How to calculate Z-score

To compute Z-score of a numerical variable, we need Mean and Standard Deviation of the variable. Then we can compute Z-score for each value of the variable by subtracting the value with mean value of the variable and then dividing by the standard deviation.

The magnitude of the Z-score shows how many standard deviations away from the mean the data point is. A Z-score of 0 means the data point is exactly at the mean. A positive Z-score indicates the data point is above the mean. A negative z-score indicates the data point is below the mean.

Let us see load the packages needed

library(tidyverse)
library(palmerpenguins)
theme_set(theme_bw(16))

We will use two variables from Palmer penguin dataset to show how to compute Z-score.

df <- penguins |>
  drop_na() %>%
  select(bill_depth_mm, body_mass_g)

df |> head()

# A tibble: 6 × 2
  bill_depth_mm body_mass_g
          <dbl>       <int>
1          18.7        3750
2          17.4        3800
3          18          3250
4          19.3        3450
5          20.6        3650
6          17.8        3625

Let us use summary() function to see the quick summary of the two numerical variables.

df |> summary()

 bill_depth_mm    body_mass_g  
 Min.   :13.10   Min.   :2700  
 1st Qu.:15.60   1st Qu.:3550  
 Median :17.30   Median :4050  
 Mean   :17.16   Mean   :4207  
 3rd Qu.:18.70   3rd Qu.:4775  
 Max.   :21.50   Max.   :6300  

Let us manually compute Z-score for these two variables one-by-one.

df <-  df |>
  mutate(bill_depth_zscore_m = (bill_depth_mm - mean(bill_depth_mm))/sd(bill_depth_mm),
         body_mass_zscore_m = ((body_mass_g)-mean(body_mass_g))/sd(body_mass_g)
         ) 

df |> head()

# A tibble: 6 × 4
  bill_depth_mm body_mass_g bill_depth_zscore_m body_mass_zscore_m
          <dbl>       <int>               <dbl>              <dbl>
1          18.7        3750               0.780             -0.568
2          17.4        3800               0.119             -0.506
3          18          3250               0.424             -1.19 
4          19.3        3450               1.08              -0.940
5          20.6        3650               1.74              -0.692
6          17.8        3625               0.323             -0.723

We can look at the summary to see hopw different the Z-scores are from the original values of the variables.

df |>
  summary()

 bill_depth_mm    body_mass_g   bill_depth_zscore_m body_mass_zscore_m
 Min.   :13.10   Min.   :2700   Min.   :-2.06418    Min.   :-1.8716   
 1st Qu.:15.60   1st Qu.:3550   1st Qu.:-0.79466    1st Qu.:-0.8160   
 Median :17.30   Median :4050   Median : 0.06862    Median :-0.1950   
 Mean   :17.16   Mean   :4207   Mean   : 0.00000    Mean   : 0.0000   
 3rd Qu.:18.70   3rd Qu.:4775   3rd Qu.: 0.77956    3rd Qu.: 0.7053   
 Max.   :21.50   Max.   :6300   Max.   : 2.20143    Max.   : 2.5992 

We can also make scatter plot see the effect of computing Z-score to the original variable. We can see nice correlation between before and after computing Z-score. We can see that range of Z-score is very different from the original values of the variable, as we expect.

df |>
  ggplot(aes(x=bill_depth_mm, y= bill_depth_zscore))+
  geom_point()+
  labs(title = "Z-score: before and after")
Comparing Z-score with its original values

Computing Z-score using scale() function

We can also compute Z-score using scale() function available in R.

df <-  df |>
  mutate(bill_depth_zscore = c(scale(bill_depth_mm)),
         body_mass_zscore = c(scale(body_mass_g)))

df |> head()

# A tibble: 6 × 6
  bill_depth_mm body_mass_g bill_depth_zscore_m body_mass_zscore_m
          <dbl>       <int>               <dbl>              <dbl>
1          18.7        3750               0.780             -0.568
2          17.4        3800               0.119             -0.506
3          18          3250               0.424             -1.19 
4          19.3        3450               1.08              -0.940
5          20.6        3650               1.74              -0.692
6          17.8        3625               0.323             -0.723
# ℹ 2 more variables: bill_depth_zscore <dbl>, body_mass_zscore <dbl>

We can check the summaries of the z-scores computed by two approaches

df |>
  select(-bill_depth_mm, -body_mass_g )|>
  summary()

 bill_depth_zscore_m body_mass_zscore_m bill_depth_zscore  body_mass_zscore 
 Min.   :-2.06418    Min.   :-1.8716    Min.   :-2.06418   Min.   :-1.8716  
 1st Qu.:-0.79466    1st Qu.:-0.8160    1st Qu.:-0.79466   1st Qu.:-0.8160  
 Median : 0.06862    Median :-0.1950    Median : 0.06862   Median :-0.1950  
 Mean   : 0.00000    Mean   : 0.0000    Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.77956    3rd Qu.: 0.7053    3rd Qu.: 0.77956   3rd Qu.: 0.7053  
 Max.   : 2.20143    Max.   : 2.5992    Max.   : 2.20143   Max.   : 2.5992  

We can check if each elements of z-scores computed by the two approaches are the same.

all.equal(df$bill_depth_zscore_m, 
          df$bill_depth_zscore)

TRUE

Leave a comment

Your email address will not be published. Required fields are marked *

Exit mobile version