How to Compute Pearson Correlation of Multiple Variables

In this tutorial, we will learn how to compute Pearson correlation of multiple variables. We will use two approaches to compute Pearson correlation of multiple variables in a matrix or dataframe.

First we will show how to use cor() function in R to compute Pearson correlation of all variables against all variables. Then we will use matrix multiplication technique to compute Pearson correlation matrix for all variables.

library(tidyverse)
library(palmerpenguin)
theme_set(theme_bw(16)

We use numerical variables from palmer penguins data show how to compute Pearson Correlation.

df <- penguins |>
  drop_na() |>
  select(-year) |>
  select(where(is.numeric))
df |> head()

## # A tibble: 6 × 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <int>       <int>
## 1           39.1          18.7               181        3750
## 2           39.5          17.4               186        3800
## 3           40.3          18                 195        3250
## 4           36.7          19.3               193        3450
## 5           39.3          20.6               190        3650
## 6           38.9          17.8               181        3625

A key step in computing correlation is to center and scale the data, i.e. the variables. We can use scale() function to center and scale each column.

df_scaled <- df |>
  scale()
df_scaled |> head()
##      bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## [1,]     -0.8946955     0.7795590        -1.4246077  -0.5676206
## [2,]     -0.8215515     0.1194043        -1.0678666  -0.5055254
## [3,]     -0.6752636     0.4240910        -0.4257325  -1.1885721
## [4,]     -1.3335592     1.0842457        -0.5684290  -0.9401915
## [5,]     -0.8581235     1.7444004        -0.7824736  -0.6918109
## [6,]     -0.9312674     0.3225288        -1.4246077  -0.7228585

We can check that centered variables mean values close to zero.

colMeans(df_scaled)
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##     -3.235317e-15     -7.304801e-16      2.808064e-16     -1.260253e-16

Compute Pearson Correlation of a matrix with cor()

We can use cor() in R to compute Pearson correlation of all columns/variables in a data matrix.

cor(df_scaled)

##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2286256         0.6530956   0.5894511
## bill_depth_mm         -0.2286256     1.0000000        -0.5777917  -0.4720157
## flipper_length_mm      0.6530956    -0.5777917         1.0000000   0.8729789
## body_mass_g            0.5894511    -0.4720157         0.8729789   1.0000000

Compute Pearson Correlation of a matrix with matrix multiplication

We can compute the Pearson correlation of all variables agains all variables using matrix multiplication, by taking transpose of the matrix and multiplying with the original data matrix.

n <- nrow(df_scaled)

(t(df_scaled) %*% df_scaled) / (n - 1)

#                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2286256         0.6530956   0.5894511
## bill_depth_mm         -0.2286256     1.0000000        -0.5777917  -0.4720157
## flipper_length_mm      0.6530956    -0.5777917         1.0000000   0.8729789
## body_mass_g            0.5894511    -0.4720157         0.8729789   1.0000000

Note we have identical results from the two approaches to compute Pearson correlation of all the variables in a dataframe/matrix.

Leave a comment

Your email address will not be published. Required fields are marked *

Exit mobile version