How to do inner join with dplyr in R

In this post, we will learn how to do an inner join in R with dplyr’s inner_join(). An inner join between two dataframes x and y, gives is the only rows from x for which we have a matching key in y.

dplyr’s inner_join() Image Credit: R4DS book

First, load tidyverse suit of packages and check the version of dplyr installed.

library(tidyverse)
packageversion("dplyr")
[1] '1.1.3'

We will create two toy dataframes, with two columns each, such that they have one common column, a key column to join the dataframes.

The first dataframe has columns, gene_name and their expression values generated by using sample() function.

set.seed(123)
df1 <- tibble(gene_name = paste0("g",c(1:4,11)),
              expression = rnorm(5, mean=30, sd=5))
df1

# A tibble: 5 × 2
  gene_name expression
  <chr>          <dbl>
1 g1              27.2
2 g2              28.8
3 g3              37.8
4 g4              30.4
5 g11             30.6

And the second dataframe has the same gene name column and chromosome location for those genes.

df2 <- tibble(gene_name = paste0("g",1:10),
              chromosome = paste0("chr", sample(1:22,10)))
# A tibble: 10 × 2
   gene_name chromosome
   <chr>     <chr>     
 1 g1        chr20     
 2 g2        chr14     
 3 g3        chr5      
 4 g4        chr19     
 5 g5        chr9      
 6 g6        chr3      
 7 g7        chr8      
 8 g8        chr10     
 9 g9        chr7      
10 g10       chr15     

How to do inner join in R with dplyr

Note that the two dataframes have a common column name that we will use as key to inner join. Also notice that the row values for the key columns are not exatly the same.

For example, the first dataframe has a fewer rows and one of the rows is not present in the second dataframe. Similarly, the second dataframe has more rows than the first and more values for key column that is not present in the first dataframe.

By inner joining, we will be matching the rows from both the dataframes whenever keys are equal using inner_join() function in dplyr as shown below.

df1 %>%
  inner_join(df2)

Joining with `by = join_by(gene_name)`
# A tibble: 4 × 3
  gene_name expression chromosome
  <chr>          <dbl> <chr>     
1 g1              27.2 chr20     
2 g2              28.8 chr14     
3 g3              37.8 chr5      
4 g4              30.4 chr19     

Note that we have not specified the column we want to use as key to match the dataframes in the above example. dplyr’s inner_join() function has automatically found the common column in both the dataframe. And the result of inner_join() tells us how inner_join() merged the two dataframes, “Joining with `by = join_by(gene_name)`”.

A good practice is to specify the column name we want to inner join with using the by argument to inner_join() function. This is important as dplyr’s inner_join() will use all the common columns by default, if there is any in the two dataframes.

df1 %>%
  inner_join(df2, by="gene_name")

# A tibble: 4 × 3
  gene_name expression chromosome
  <chr>          <dbl> <chr>     
1 g1              23.3 chr8      
2 g2              38.5 chr14     
3 g3              33.7 chr10     
4 g4              29.1 chr2      

Since inner join keeps only the matching rows, the order we use to join does not matter. In the example below, we inner join the second dataframe with the first dataframe and get the same results. Note that the order of resulting columns are different as the column from the second dataframe comes first.

df2 %>%
  inner_join(df1, by="gene_name")

# A tibble: 4 × 3
  gene_name chromosome expression
  <chr>     <chr>           <dbl>
1 g1        chr20            27.2
2 g2        chr14            28.8
3 g3        chr5             37.8
4 g4        chr19            30.4
# A tibble: 4 × 3

In the examples above, the tow dataframes have the same column name by which we are joining. Often you might want to inner join two dataframes with two different column names. See this post to learn how to join two dataframes with different column names.

Exit mobile version