Computing Correlation with R

In this tutorial, we will learn how to compute correlation between two numerical variables in R using cor() function available in base R.

Correlation between two numerical variables can range from -1 to +1, where -ve values suggest these two variables negatively correlated and positive value suggest that the variables are positively correlated. When there is no correlation between the two variables, the correlation value will be around zero.

First, we will compute correlation between two numerical vectors. Next, we will see two examples of how to compute correlation between two numerical variables present in a dataframe.

How to compute correlation between two numerical vectors

First, let us generate two numerical variables, x and y, using random numbers from normal distribution.

set.seed(21)
# generate x variable: random numbers from normal distribution
x <- rnorm(30)

# round the numbers to single digit for printing 
round(x, 1)
x

[1]  0.8  0.5  1.7 -1.3  2.2  0.4 -1.6 -0.9  0.1  0.0 -2.3  0.8 -0.5  0.2  0.6
[16]  1.5  0.7  1.1 -0.8 -0.4  0.4  0.0 -1.0 -1.3 -0.2  0.7  0.3 -1.1 -0.7 -0.7

Similarly let us create second numerical variable y.

# generate y variable: 
y <- rnorm(30)

# round the numbers to single digit for printing 
round(y, 1)

 [1] -1.8 -0.4  0.0  0.9  1.6  0.1  1.8  0.1  1.4  1.5  0.1 -1.5  0.0 -0.2  2.1
[16]  0.2  0.5  1.7  0.4 -1.3 -0.6  1.8 -0.2 -0.4  0.6  1.0  0.0 -0.9  0.9  1.2

Now we can find the correlation between these two variables x and y using cor() function in base R.

# compute correlation between two numerical variables
cor(x, y)

[1] 0.05046053

In this example where we computed correlation between two randomly generated variables and the correlation is close to zero, suggesting that there is correlation between them.

Computing Spearman correlation in R

Correlation can be computed by different methodologies. By default, cor() function uses “pearson correlation” method to compute correlation. cor() function can compute correlation by two other methods, spearman and kendal.

To compute correlation by a specific method, we should provide the method argument with the name of the correlation methods. For example, to compute spearman rank corelation between the same two variables x and y, we should

cor(x, y, method="spearman")

[1] 0.01846496

Note that the correlation values from pearson and spearman differs.

Computing Correlation between two numerical variables in a dataframe

We will show how to compute correlation between two numerical variables from a dataframe in two slightly different ways. First, we will save the variable of interest from the dataframe as separate variables. Then we compute the correlation.

Let us use palmer penguin dataset available as a dataframe from palmer penguin package

library(palmerpenguins)
df <- penguins %>%
  drop_na()
df %>% head()

# A tibble: 6 × 8
  species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
  <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
1 Adelie  Torge…           39.1          18.7              181        3750 male 
2 Adelie  Torge…           39.5          17.4              186        3800 fema…
3 Adelie  Torge…           40.3          18                195        3250 fema…
4 Adelie  Torge…           36.7          19.3              193        3450 fema…
5 Adelie  Torge…           39.3          20.6              190        3650 male 
6 Adelie  Torge…           38.9          17.8              181        3625 fema…
# … with 1 more variable: year <int>

To compute correlation between body mass and flipper length, we will extract those two variables from the dataframe and save as new variables.

body_mass <- df %>% 
     pull(body_mass_g)
flipper_length <- df %>% 
     pull(flipper_length_mm)

Now we can compute correlation as before using cor() function. In this example, these two variables are highly correlated with pearson correlation value of ~ 0.88.

cor(body_mass, flipper_length)

[1] 0.8729789

Using base R notation, we can directly access a variable from a dataframe using $ symbol. In this second approach we compute correlation by getting the variable from the dataframe using $ symbol as shown below.

cor(df$body_mass_g, df$flipper_length_mm)

[1] 0.8729789
Exit mobile version