In this tutorial, we will learn about the base R function duplicated() and how can we use duplicated() function to find if an element in a vector is duplicated or a row in a dataframe is duplicated. duplicated() function can take a vector, matrix or a dataframe as input and give us boolean or logical vector telling if it duplicated or not.
Find Duplicate elements in a vector with duplicated()
Let us create some data vector with duplicates. Here we use sample() function to get bootstrapped samples with replacements.
set.seed(123) x <- sample(10,10, replace=TRUE)
We can see that our data vector contains multiple duplicates. Basically first and second elements are duplicated , fifth and eighth elements are duplicated and third and 10th elements are duplicated.
x ## [1] 3 3 10 2 6 5 4 6 9 10
We can use duplicated() function to identify duplicated elements. duplicated() function on the vector gives us boolean vector with TRUE values where there is duplicated elements.
duplicated(x) ## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
Here we use which() function identify the indices of boolean vector where there is TRUE values. This gives us the indices of duplicated index.
which(duplicated(x)) ## [1] 2 8 10
By default, duplicated() function identifies duplicated elements from first element. With fromLast=TRUE argument, we can identify duplicated elements from last. Here is example with fromLast=TRUE on the same data.
We can see that now different elements are identified as duplicates.
duplicated(x, fromLast = TRUE) ## [1] TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
Find the index of first duplicate elements in a vector
anyDuplicated() function in R is a related function that is useful to identify the index of first duplicate elements. It returns the index i of the first duplicated entry x[i] if there is one, and 0 otherwise
anyDuplicated(x) ## [1] 2
Note that it stops after identifying the first duplicates. In our example, we have more elements duplicated after the first duplicated element.
Find Duplicated rows in a dataframe with duplicated()
duplicated() function is also useful in identifying duplicated rows in a dataframe. Let us create a dataframe with duplicated rows using sample() and tibble() function in tidyverse().
set.seed(123) df <- tibble( a = sample(3, 10, rep = TRUE), b = sample(3, 10, rep = TRUE) #c = sample(3, 10, rep = TRUE), )
Our dataframe looks like this with two columns and a few duplicated rows.
df ## # A tibble: 10 x 2 ## a b ## <int> <int> ## 1 3 2 ## 2 3 2 ## 3 3 1 ## 4 2 2 ## 5 3 3 ## 6 2 1 ## 7 2 3 ## 8 2 3 ## 9 3 1 ## 10 1 1
By using duplicated() function on the dataframe we can get boolean vector identifying if the row is duplicated or not.
duplicated(df) ## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
In this example, we can see that second row is duplicated and eighth row is also duplicated as they are TRUE in the boolean vector.
Find Duplicated rows in a matrix with duplicated()
We can use duplicated() function on a matrix to find the rows that are duplicated. Let us convert the dataframe we created above into a matrix using as.matrix() function.
mat <- as.matrix(df)
Now we have our data as matrix and using duplicated() function on the matrix, we can identify the rows that are duplicated.
duplicated(mat) ## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE