Computing Correlation Between Multiple Variables in a dataframe

Compute correlations between multiple variables in a dataframe
Compute correlations between multiple variables in a dataframe

corrr package part of tidymodels meta R packages can compute correlation between multiple variables and offers tools further dissect and visualize the correlation. In this tutorial, we will learn how to use correlate() function corrr package to compute correlation all numerical variables in a dataframe.

Correlations in R with corrr package

First, we load the packages needed and look at the version of corrr package.

library(tidyverse)
library(corrr)
packageVersion("corrr")

[1] '0.4.4'

For computing correlation, we will use the classic iris dataset that is built-in R. Note that iris data has one categorical variable and four numerical variables.

iris %>% head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Computing correlation between all numerical variables

We can use correlate() function in corrr package to compute correlation between all numerical variables. Here we use the pipe operator %>% to feed the data to correlate() function.

iris %>%
  correlate()

Non-numeric variables removed from input: `Species`
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'

By default, correlate() function uses pearson correlation. Also, it removes any non-numeric variables present in the dataframe before computing correlation. We get the full correlation matrix as tibble with correlation between all numerical variables as shown below.

# A tibble: 4 × 5
  term         Sepal.Length Sepal.Width Petal.Length Petal.Width
  <chr>               <dbl>       <dbl>        <dbl>       <dbl>
1 Sepal.Length       NA          -0.118        0.872       0.818
2 Sepal.Width        -0.118      NA           -0.428      -0.366
3 Petal.Length        0.872      -0.428       NA           0.963
4 Petal.Width         0.818      -0.366        0.963      NA  

With rearrange() function we can sort by correlation values.

iris %>%
  correlate() %>%
  rearrange() 

Non-numeric variables removed from input: `Species`
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'

# A tibble: 4 × 5
  term         Petal.Length Petal.Width Sepal.Length Sepal.Width
  <chr>               <dbl>       <dbl>        <dbl>       <dbl>
1 Petal.Length       NA           0.963        0.872      -0.428
2 Petal.Width         0.963      NA            0.818      -0.366
3 Sepal.Length        0.872       0.818       NA          -0.118
4 Sepal.Width        -0.428      -0.366       -0.118      NA    

To get the correlation value just from lower triangular, we can use shave() function as shown below.

iris %>%
  correlate() %>%
  rearrange() %>%
  shave()

Non-numeric variables removed from input: `Species`
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'

# A tibble: 4 × 5
  term         Petal.Length Petal.Width Sepal.Length Sepal.Width
  <chr>               <dbl>       <dbl>        <dbl>       <dbl>
1 Petal.Length       NA          NA           NA              NA
2 Petal.Width         0.963      NA           NA              NA
3 Sepal.Length        0.872       0.818       NA              NA
4 Sepal.Width        -0.428      -0.366       -0.118          NA

Similarly we can specify the method that we want to use to compute correlation with method argument to correlate() function. In the example below we use spearman correlation to compute the correlation between all numerical variables.

iris %>%
  correlate(method="spearman") %>%
  rearrange() %>%
  shave()
Non-numeric variables removed from input: `Species`
Correlation computed with
• Method: 'spearman'
• Missing treated using: 'pairwise.complete.obs'

# A tibble: 4 × 5
  term         Petal.Length Petal.Width Sepal.Length Sepal.Width
  <chr>               <dbl>       <dbl>        <dbl>       <dbl>
1 Petal.Length       NA          NA           NA              NA
2 Petal.Width         0.938      NA           NA              NA
3 Sepal.Length        0.882       0.834       NA              NA
4 Sepal.Width        -0.310      -0.289       -0.167          NA
Exit mobile version