corrr package part of tidymodels meta R packages can compute correlation between multiple variables and offers tools further dissect and visualize the correlation. In this tutorial, we will learn how to use correlate() function corrr package to compute correlation all numerical variables in a dataframe.
First, we load the packages needed and look at the version of corrr package.
library(tidyverse) library(corrr) packageVersion("corrr") [1] '0.4.4'
For computing correlation, we will use the classic iris dataset that is built-in R. Note that iris data has one categorical variable and four numerical variables.
iris %>% head() Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
Computing correlation between all numerical variables
We can use correlate() function in corrr package to compute correlation between all numerical variables. Here we use the pipe operator %>% to feed the data to correlate() function.
iris %>% correlate() Non-numeric variables removed from input: `Species` Correlation computed with • Method: 'pearson' • Missing treated using: 'pairwise.complete.obs'
By default, correlate() function uses pearson correlation. Also, it removes any non-numeric variables present in the dataframe before computing correlation. We get the full correlation matrix as tibble with correlation between all numerical variables as shown below.
# A tibble: 4 × 5 term Sepal.Length Sepal.Width Petal.Length Petal.Width <chr> <dbl> <dbl> <dbl> <dbl> 1 Sepal.Length NA -0.118 0.872 0.818 2 Sepal.Width -0.118 NA -0.428 -0.366 3 Petal.Length 0.872 -0.428 NA 0.963 4 Petal.Width 0.818 -0.366 0.963 NA
With rearrange() function we can sort by correlation values.
iris %>% correlate() %>% rearrange() Non-numeric variables removed from input: `Species` Correlation computed with • Method: 'pearson' • Missing treated using: 'pairwise.complete.obs' # A tibble: 4 × 5 term Petal.Length Petal.Width Sepal.Length Sepal.Width <chr> <dbl> <dbl> <dbl> <dbl> 1 Petal.Length NA 0.963 0.872 -0.428 2 Petal.Width 0.963 NA 0.818 -0.366 3 Sepal.Length 0.872 0.818 NA -0.118 4 Sepal.Width -0.428 -0.366 -0.118 NA
To get the correlation value just from lower triangular, we can use shave() function as shown below.
iris %>% correlate() %>% rearrange() %>% shave() Non-numeric variables removed from input: `Species` Correlation computed with • Method: 'pearson' • Missing treated using: 'pairwise.complete.obs' # A tibble: 4 × 5 term Petal.Length Petal.Width Sepal.Length Sepal.Width <chr> <dbl> <dbl> <dbl> <dbl> 1 Petal.Length NA NA NA NA 2 Petal.Width 0.963 NA NA NA 3 Sepal.Length 0.872 0.818 NA NA 4 Sepal.Width -0.428 -0.366 -0.118 NA
Similarly we can specify the method that we want to use to compute correlation with method argument to correlate() function. In the example below we use spearman correlation to compute the correlation between all numerical variables.
iris %>% correlate(method="spearman") %>% rearrange() %>% shave()
Non-numeric variables removed from input: `Species` Correlation computed with • Method: 'spearman' • Missing treated using: 'pairwise.complete.obs' # A tibble: 4 × 5 term Petal.Length Petal.Width Sepal.Length Sepal.Width <chr> <dbl> <dbl> <dbl> <dbl> 1 Petal.Length NA NA NA NA 2 Petal.Width 0.938 NA NA NA 3 Sepal.Length 0.882 0.834 NA NA 4 Sepal.Width -0.310 -0.289 -0.167 NA