In this tutorial, we will learn how to select top and bottom rows of a dataframe based on the values of a column with tidyverse in R. Here we will see a specific approach to get top and bottom rows when the values of the column of interest has both positive and negative values. This would not work if the values are not both positive and negative.
Let us load the packages needed.
library(tidyverse) theme_set(theme_bw(16)
We will create a sample dataframe with a column containing both positive and negative values.
set.seed(2024) df <- tibble(symbol=sample(letters,10), value = rnorm(10, mean = 5,sd=10)) df # A tibble: 10 × 2 symbol value <chr> <dbl> 1 b -0.376 2 e -7.16 3 m 15.4 4 q -0.467 5 x 14.0 6 i -15.3 7 z -0.525 8 k 24.3 9 p 9.70 10 n 23.3
Our goal is to select the rows of the dataframe containing top n rows with positive values and bottom b nrows with negative values.
For example, if we are manually pick the both the top 3 and bottom 3 rows, that would give us the first 3 and last 3 rows after sorting as shown below.
df |> arrange(value) # A tibble: 10 × 2 symbol value <chr> <dbl> 1 i -15.3 2 e -7.16 3 z -0.525 4 q -0.467 5 b -0.376 6 p 9.70 7 x 14.0 8 m 15.4 9 n 23.3 10 k 24.3
Let us see how we can use tidyverse to select both the top and bottom rows. First, let create another column specifying if the values are positive or negative. This helps us creating a grouping variable to use to select top and bottom rows.
df |> mutate(direction = ifelse(value>0,"+ve","-ve")) # A tibble: 10 × 3 symbol value direction <chr> <dbl> <chr> 1 b -0.376 -ve 2 e -7.16 -ve 3 m 15.4 +ve 4 q -0.467 -ve 5 x 14.0 +ve 6 i -15.3 -ve 7 z -0.525 -ve 8 k 24.3 +ve 9 p 9.70 +ve 10 n 23.3 +ve
Select Top and bottom rows simultaenously with slice_max() and rank()
We can group by the variable and then select the top and bottom rows by rank/magnitude using slice_max() with rank() function as shown below. Note the “-” sign before the rank function.
df |> mutate(direction = ifelse(value>0,"+ve","-ve")) |> group_by(direction) |> slice_max(n=3,-rank(value)) # A tibble: 6 × 3 # Groups: direction [2] symbol value direction <chr> <dbl> <chr> 1 p 9.70 +ve 2 x 14.0 +ve 3 m 15.4 +ve 4 i -15.3 -ve 5 e -7.16 -ve 6 z -0.525 -ve
Without the “-” sign before the rank() function we will get incorrect results like shown below.
df |> mutate(direction = ifelse(value>0,"+ve","-ve")) |> group_by(direction) |> slice_max(n=3, rank(value)) # A tibble: 6 × 3 # Groups: direction [2] symbol value direction <chr> <dbl> <chr> 1 k 24.3 +ve 2 n 23.3 +ve 3 m 15.4 +ve 4 b -0.376 -ve 5 q -0.467 -ve 6 z -0.525 -ve