In this post, we will learn how to use dplyr’s n_distinct() function to count the number of unique or distinct values in one or more vectors or columns of a dataframe.
dplyr’s n_distinct() is very useful when you are working with a dataframe and need to know how many unique or distinct values or combinatons are there. dplyr claims that it is faster than counting using unique() function in base R.
We will see two examples of using dplyr’s n_distinct() function, first applying n_distinct() on a vector or a column with no missing values and then see an example of using n_distinct() with missing vales (NAs).
library(tidyverse) packageVersion("dplyr") [1] '1.1.4'
Let us create a sample dataframe with a column that we are interested in counting the number of unique values. Let that column also contain NA, to illustrate how to handle missing values with n_distinct()
# Create a data frame df <- tibble( id = c(2, 4, 1, 2, 3, 4, NA), amount = c(250, 200, 250, 150, 300, 120,200) ) df # A tibble: 7 × 2 id amount <dbl> <dbl> 1 2 250 2 4 200 3 1 250 4 2 150 5 3 300 6 4 120 7 NA 200
To count the number of unique values in a column, we can use dplyr’s n_distinct() function as shown below. And we get the column “id” has 5 unique values, 1,2,3,4, and NA.
Note that by default n_distinct() does not remove NAs and count.
n_distinct(df$id) [1] 5
We can also use tidyverse approach to get the number of unique/distinct elements in a column as shown below.
df |> pull(id) |> n_distinct() [1] 5
If we want to ignore missing values while counting the number of elements, we need to use na.rm=TRUE as an argument to n_distinct() function as shown below.
df |> pull(id) |> n_distinct(na.rm=TRUE) [1] 4
n_distinct() to count number of unique rows in a dataframe
Note our data frame has a row duplicated. We can n_distinct() to compute the number of unique rows
df # A tibble: 7 × 2 id amount <dbl> <dbl> 1 2 250 2 4 200 3 1 150 4 2 250 5 3 300 6 4 120 7 NA 200
Including NA, we have 6 distinct rows in the dataframe.
df |> n_distinct() [1] 6