In this tutorial, we will learn 3 ways to rank integers in tidyverse. Tidyverse’s dplyr has three integer ranking functions, row_number(), min_rank(), and dense_Rank(), inspired by SQL. And these integer ranking functions differ in how they handle ties.
library(tidyverse) packageVersion("dplyr") [1] '1.1.2'
Let us jump into simple examples as given by dplyr and create tibble with a sorted column with ties.
df <- tibble(x = c(10,20,20,60)) print(df) # A tibble: 4 × 1 x <dbl> 1 10 2 20 3 20 4 60
unique rank with row_number()
row_number() gives every input a unique rank, so that c(10, 20, 20, 30) would get ranks c(1, 2, 3, 4). It’s equivalent to rank(ties.method = “first”).
df %>% mutate(row_no = row_number(x)) # A tibble: 4 × 2 x row_no <dbl> <int> 1 10 1 2 20 2 3 20 3 4 60 4
min_rank(): lowest rank for all tied elements
min_rank() function deals with any ties by assigning the lowest rank to all tied elements. For example
df %>% mutate(min_rank = min_rank(x)) # A tibble: 4 × 2 x min_rank <dbl> <int> 1 10 1 2 20 2 3 20 2 4 60 4
dense_rank(): ranking with no gaps
dense_rank() is similar to min_rank() in that it provides the same smallest rank to tied elements, but it does not leave any gaps unlike min_rank(). For example
df %>% mutate(dense_rank = dense_rank(x)) # A tibble: 4 × 2 x dense_rank <dbl> <int> 1 10 1 2 20 2 3 20 2 4 60 3
3 ranking functions in action
The previous examples showed how the three ranking functions work and their difference. Now let us see another example where the original column is not sorted.
Our data looks like this.
df2 <- tibble( y = c(8,5,4,4,6)) print(df2) # A tibble: 5 × 1 y <dbl> 1 8 2 5 3 4 4 4 5 6
The ranking function row_number() would give us
df2 %>% mutate(row_no = row_number(y)) # A tibble: 5 × 2 y row_no <dbl> <int> 1 8 5 2 5 3 3 4 1 4 4 2 5 6 4
The ranking function min_rank() would give us
df2 %>% mutate(min_rank = min_rank(y)) # A tibble: 5 × 2 y min_rank <dbl> <int> 1 8 5 2 5 3 3 4 1 4 4 1 5 6 4
The ranking function dplyr’s dense_rank() would give us
df2 %>% mutate(dense_rank = dense_rank(y)) # A tibble: 5 × 2 y dense_rank <dbl> <int> 1 8 4 2 5 2 3 4 1 4 4 1 5 6 3