In this tutorial, we will learn how to use dplyr’s anti_join() function to filter rows from one dataframe based on the presence or absence of matches in another dataframe. dplyr’s anti_join() function is extremely useful for cases when we want to find what row is present or missing in a dataframe when compared to another dataframe.
library(tidyverse)
dplyr’s anti_join() function takes two dataframes df1 and df2 as arguments and return all rows from df1 without a match in df2. Let us create two toy dataframes to illustrate how to use dplyr’s anti_join().
Here is our first dataframe df1, with three rows.
set.seed(123) id = sample(1:5,3, replace = FALSE) df1 <- tibble(id , packages = c("dplyr", "tidyr","tibble")) df1 # A tibble: 3 × 2 id packages <int> <chr> 1 3 dplyr 2 2 tidyr 3 5 tibble
The second dataframe also has three rows.
set.seed(123) id = sample(1:6,3, replace = FALSE) df2 <- tibble(id, packages= c("dplyr","ggplot2","tidyr")) df2 # A tibble: 3 × 2 id packages <int> <chr> 1 3 dplyr 2 6 ggplot2 3 2 tidyr
Note that not all rows are the same in the two dataframes. The third row in df1 is not present in df2. Similarly second row in df2 is not present in df1.
dplyr’s anti_join() to get rows present in one dataframe but not in another
By using anti_join() on df1 and df2, we will get a new dataframe with one row that is present in df1, but not in df2. Also note that in the simple use case of anti_join(), it performed join using both the columns.
anti_join(df1,df2) Joining with `by = join_by(id, packages)` # A tibble: 1 × 2 id packages <int> <chr> 1 5 tibble
If we reverse the order of input to anti_join() df2 first and df1 next, we will get a dataframe with rows that is present in df2, but not in df1.
anti_join(df2, df1) Joining with `by = join_by(id, packages)` # A tibble: 1 × 2 id packages <int> <chr> 1 6 ggplot2