In this tutorial, we will learn how to compute Pearson correlation of multiple variables. We will use two approaches to compute Pearson correlation of multiple variables in a matrix or dataframe.
First we will show how to use cor() function in R to compute Pearson correlation of all variables against all variables. Then we will use matrix multiplication technique to compute Pearson correlation matrix for all variables.
library(tidyverse) library(palmerpenguin) theme_set(theme_bw(16)
We use numerical variables from palmer penguins data show how to compute Pearson Correlation.
df <- penguins |> drop_na() |> select(-year) |> select(where(is.numeric))
df |> head() ## # A tibble: 6 × 4 ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <dbl> <dbl> <int> <int> ## 1 39.1 18.7 181 3750 ## 2 39.5 17.4 186 3800 ## 3 40.3 18 195 3250 ## 4 36.7 19.3 193 3450 ## 5 39.3 20.6 190 3650 ## 6 38.9 17.8 181 3625
A key step in computing correlation is to center and scale the data, i.e. the variables. We can use scale() function to center and scale each column.
df_scaled <- df |> scale() df_scaled |> head() ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## [1,] -0.8946955 0.7795590 -1.4246077 -0.5676206 ## [2,] -0.8215515 0.1194043 -1.0678666 -0.5055254 ## [3,] -0.6752636 0.4240910 -0.4257325 -1.1885721 ## [4,] -1.3335592 1.0842457 -0.5684290 -0.9401915 ## [5,] -0.8581235 1.7444004 -0.7824736 -0.6918109 ## [6,] -0.9312674 0.3225288 -1.4246077 -0.7228585
We can check that centered variables mean values close to zero.
colMeans(df_scaled) ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## -3.235317e-15 -7.304801e-16 2.808064e-16 -1.260253e-16
Compute Pearson Correlation of a matrix with cor()
We can use cor() in R to compute Pearson correlation of all columns/variables in a data matrix.
cor(df_scaled) ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## bill_length_mm 1.0000000 -0.2286256 0.6530956 0.5894511 ## bill_depth_mm -0.2286256 1.0000000 -0.5777917 -0.4720157 ## flipper_length_mm 0.6530956 -0.5777917 1.0000000 0.8729789 ## body_mass_g 0.5894511 -0.4720157 0.8729789 1.0000000
Compute Pearson Correlation of a matrix with matrix multiplication
We can compute the Pearson correlation of all variables agains all variables using matrix multiplication, by taking transpose of the matrix and multiplying with the original data matrix.
n <- nrow(df_scaled) (t(df_scaled) %*% df_scaled) / (n - 1) # bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## bill_length_mm 1.0000000 -0.2286256 0.6530956 0.5894511 ## bill_depth_mm -0.2286256 1.0000000 -0.5777917 -0.4720157 ## flipper_length_mm 0.6530956 -0.5777917 1.0000000 0.8729789 ## body_mass_g 0.5894511 -0.4720157 0.8729789 1.0000000
Note we have identical results from the two approaches to compute Pearson correlation of all the variables in a dataframe/matrix.