In this post, we will learn how to compute Z-score in R using two different approaches, first manually by using the z-score formula and then using scale() function available in base R.
What is Z-score
Z-score is a commonly used transformation technique to standardize/normalize a numerical variable. Transforming numerical variables into Z-scores enable us to make comparison across the variables easily.
At the core Z-score is a statistical measure that describes how far a data point is from the mean of a variable, expressed in terms of standard deviations. It is a great way used to understand how unusual or typical a particular data point is within a distribution.
To compute Z-score of a numerical variable, we need Mean and Standard Deviation of the variable. Then we can compute Z-score for each value of the variable by subtracting the value with mean value of the variable and then dividing by the standard deviation.
The magnitude of the Z-score shows how many standard deviations away from the mean the data point is. A Z-score of 0 means the data point is exactly at the mean. A positive Z-score indicates the data point is above the mean. A negative z-score indicates the data point is below the mean.
Let us see load the packages needed
library(tidyverse) library(palmerpenguins) theme_set(theme_bw(16))
We will use two variables from Palmer penguin dataset to show how to compute Z-score.
df <- penguins |> drop_na() %>% select(bill_depth_mm, body_mass_g) df |> head() # A tibble: 6 × 2 bill_depth_mm body_mass_g <dbl> <int> 1 18.7 3750 2 17.4 3800 3 18 3250 4 19.3 3450 5 20.6 3650 6 17.8 3625
Let us use summary() function to see the quick summary of the two numerical variables.
df |> summary() bill_depth_mm body_mass_g Min. :13.10 Min. :2700 1st Qu.:15.60 1st Qu.:3550 Median :17.30 Median :4050 Mean :17.16 Mean :4207 3rd Qu.:18.70 3rd Qu.:4775 Max. :21.50 Max. :6300
Let us manually compute Z-score for these two variables one-by-one.
df <- df |> mutate(bill_depth_zscore_m = (bill_depth_mm - mean(bill_depth_mm))/sd(bill_depth_mm), body_mass_zscore_m = ((body_mass_g)-mean(body_mass_g))/sd(body_mass_g) ) df |> head() # A tibble: 6 × 4 bill_depth_mm body_mass_g bill_depth_zscore_m body_mass_zscore_m <dbl> <int> <dbl> <dbl> 1 18.7 3750 0.780 -0.568 2 17.4 3800 0.119 -0.506 3 18 3250 0.424 -1.19 4 19.3 3450 1.08 -0.940 5 20.6 3650 1.74 -0.692 6 17.8 3625 0.323 -0.723
We can look at the summary to see hopw different the Z-scores are from the original values of the variables.
df |> summary() bill_depth_mm body_mass_g bill_depth_zscore_m body_mass_zscore_m Min. :13.10 Min. :2700 Min. :-2.06418 Min. :-1.8716 1st Qu.:15.60 1st Qu.:3550 1st Qu.:-0.79466 1st Qu.:-0.8160 Median :17.30 Median :4050 Median : 0.06862 Median :-0.1950 Mean :17.16 Mean :4207 Mean : 0.00000 Mean : 0.0000 3rd Qu.:18.70 3rd Qu.:4775 3rd Qu.: 0.77956 3rd Qu.: 0.7053 Max. :21.50 Max. :6300 Max. : 2.20143 Max. : 2.5992
We can also make scatter plot see the effect of computing Z-score to the original variable. We can see nice correlation between before and after computing Z-score. We can see that range of Z-score is very different from the original values of the variable, as we expect.
df |> ggplot(aes(x=bill_depth_mm, y= bill_depth_zscore))+ geom_point()+ labs(title = "Z-score: before and after")
Computing Z-score using scale() function
We can also compute Z-score using scale() function available in R.
df <- df |> mutate(bill_depth_zscore = c(scale(bill_depth_mm)), body_mass_zscore = c(scale(body_mass_g))) df |> head() # A tibble: 6 × 6 bill_depth_mm body_mass_g bill_depth_zscore_m body_mass_zscore_m <dbl> <int> <dbl> <dbl> 1 18.7 3750 0.780 -0.568 2 17.4 3800 0.119 -0.506 3 18 3250 0.424 -1.19 4 19.3 3450 1.08 -0.940 5 20.6 3650 1.74 -0.692 6 17.8 3625 0.323 -0.723 # ℹ 2 more variables: bill_depth_zscore <dbl>, body_mass_zscore <dbl>
We can check the summaries of the z-scores computed by two approaches
df |> select(-bill_depth_mm, -body_mass_g )|> summary() bill_depth_zscore_m body_mass_zscore_m bill_depth_zscore body_mass_zscore Min. :-2.06418 Min. :-1.8716 Min. :-2.06418 Min. :-1.8716 1st Qu.:-0.79466 1st Qu.:-0.8160 1st Qu.:-0.79466 1st Qu.:-0.8160 Median : 0.06862 Median :-0.1950 Median : 0.06862 Median :-0.1950 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 3rd Qu.: 0.77956 3rd Qu.: 0.7053 3rd Qu.: 0.77956 3rd Qu.: 0.7053 Max. : 2.20143 Max. : 2.5992 Max. : 2.20143 Max. : 2.5992
We can check if each elements of z-scores computed by the two approaches are the same.
all.equal(df$bill_depth_zscore_m, df$bill_depth_zscore) TRUE
[…] this post, we will learn how to compute Z-score of multiple variables (columns) at the same time using tidyverse in R using multiple […]