In this tutorial, we will learn how to compute correlation between two numerical variables in R using cor() function available in base R.
Correlation between two numerical variables can range from -1 to +1, where -ve values suggest these two variables negatively correlated and positive value suggest that the variables are positively correlated. When there is no correlation between the two variables, the correlation value will be around zero.
First, we will compute correlation between two numerical vectors. Next, we will see two examples of how to compute correlation between two numerical variables present in a dataframe.
How to compute correlation between two numerical vectors
First, let us generate two numerical variables, x and y, using random numbers from normal distribution.
set.seed(21) # generate x variable: random numbers from normal distribution x <- rnorm(30) # round the numbers to single digit for printing round(x, 1) x [1] 0.8 0.5 1.7 -1.3 2.2 0.4 -1.6 -0.9 0.1 0.0 -2.3 0.8 -0.5 0.2 0.6 [16] 1.5 0.7 1.1 -0.8 -0.4 0.4 0.0 -1.0 -1.3 -0.2 0.7 0.3 -1.1 -0.7 -0.7
Similarly let us create second numerical variable y.
# generate y variable: y <- rnorm(30) # round the numbers to single digit for printing round(y, 1) [1] -1.8 -0.4 0.0 0.9 1.6 0.1 1.8 0.1 1.4 1.5 0.1 -1.5 0.0 -0.2 2.1 [16] 0.2 0.5 1.7 0.4 -1.3 -0.6 1.8 -0.2 -0.4 0.6 1.0 0.0 -0.9 0.9 1.2
Now we can find the correlation between these two variables x and y using cor() function in base R.
# compute correlation between two numerical variables cor(x, y) [1] 0.05046053
In this example where we computed correlation between two randomly generated variables and the correlation is close to zero, suggesting that there is correlation between them.
Computing Spearman correlation in R
Correlation can be computed by different methodologies. By default, cor() function uses “pearson correlation” method to compute correlation. cor() function can compute correlation by two other methods, spearman and kendal.
To compute correlation by a specific method, we should provide the method argument with the name of the correlation methods. For example, to compute spearman rank corelation between the same two variables x and y, we should
cor(x, y, method="spearman") [1] 0.01846496
Note that the correlation values from pearson and spearman differs.
Computing Correlation between two numerical variables in a dataframe
We will show how to compute correlation between two numerical variables from a dataframe in two slightly different ways. First, we will save the variable of interest from the dataframe as separate variables. Then we compute the correlation.
Let us use palmer penguin dataset available as a dataframe from palmer penguin package
library(palmerpenguins) df <- penguins %>% drop_na()
df %>% head() # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex <fct> <fct> <dbl> <dbl> <int> <int> <fct> 1 Adelie Torge… 39.1 18.7 181 3750 male 2 Adelie Torge… 39.5 17.4 186 3800 fema… 3 Adelie Torge… 40.3 18 195 3250 fema… 4 Adelie Torge… 36.7 19.3 193 3450 fema… 5 Adelie Torge… 39.3 20.6 190 3650 male 6 Adelie Torge… 38.9 17.8 181 3625 fema… # … with 1 more variable: year <int>
To compute correlation between body mass and flipper length, we will extract those two variables from the dataframe and save as new variables.
body_mass <- df %>% pull(body_mass_g) flipper_length <- df %>% pull(flipper_length_mm)
Now we can compute correlation as before using cor() function. In this example, these two variables are highly correlated with pearson correlation value of ~ 0.88.
cor(body_mass, flipper_length) [1] 0.8729789
Using base R notation, we can directly access a variable from a dataframe using $ symbol. In this second approach we compute correlation by getting the variable from the dataframe using $ symbol as shown below.
cor(df$body_mass_g, df$flipper_length_mm) [1] 0.8729789