In this tutorial, we will learn how to sort a dataframe by one or more columns using dplyr’s arrange() function. dplyr’s arrange() function is one of the important functions in dplyr that lets you use dplyr to sort rows. By sorting, we mean dplyr’s arrange() changes the order of the rows. based on. the values of a column(s) without changing its content. It only affect the rows and leave the columns unchanged.
Basic syntax of using arrange() is that, it takes a data frame and one or more column names to order by. When we use arrange() with multiple columns
each additional column will be used to break ties in the values of preceding columns
Let us get started by loading tidyverse.
library(tidyverse)
We will use fuel economy data available as mpg with ggplot2 package that is part of tidyverse.
mpg %>% head() ## # A tibble: 6 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa… ## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa… ## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa… ## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa… ## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa… ## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
dplyr arrange(): Sort by a column
To arrange rows of the dataframe by values of one column in the data, we provide the column name as argument to arrange() function (in addition to dataframe). In the example below we use pipe operator %>% to feed the data to arrange() function to sort by the column cty.
The column cty contains, mileage for city driving for each of the car . And we are sorting the rows by the city mileage value.
dplyr’s arrange() function rearranges the rows in ascending order of mileage value.
mpg %>% arrange(cty) %>% head() ## # A tibble: 6 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 dodge dakota pic… 4.7 2008 8 auto… 4 9 12 e pick… ## 2 dodge durango 4wd 4.7 2008 8 auto… 4 9 12 e suv ## 3 dodge ram 1500 p… 4.7 2008 8 auto… 4 9 12 e pick… ## 4 dodge ram 1500 p… 4.7 2008 8 manu… 4 9 12 e pick… ## 5 jeep grand cher… 4.7 2008 8 auto… 4 9 12 e suv ## 6 chevrolet c1500 subu… 5.3 2008 8 auto… r 11 15 e suv
We can see that cars with largest city mileage will be at the last row of the dataframe.
mpg %>% arrange(cty) %>% tail() ## # A tibble: 6 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 toyota corolla 1.8 2008 4 auto(… f 26 35 r comp… ## 2 honda civic 1.6 1999 4 manua… f 28 33 r subc… ## 3 toyota corolla 1.8 2008 4 manua… f 28 37 r comp… ## 4 volkswagen new beetle 1.9 1999 4 auto(… f 29 41 d subc… ## 5 volkswagen jetta 1.9 1999 4 manua… f 33 44 d comp… ## 6 volkswagen new beetle 1.9 1999 4 manua… f 35 44 d subc…
dplyr arrange(): Sort by a column in descending order
As we saw, by default dplyr’s arrange() reorders rows by a column in ascending order. We can use desc() function to re-order by a column in descending order.
For example, this code below sorts by cty column in descending order. Therefore it shows the cars with highest city mileage first.
mpg %>% arrange(desc(cty)) %>% head() ## # A tibble: 6 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 volkswagen new beetle 1.9 1999 4 manua… f 35 44 d subc… ## 2 volkswagen jetta 1.9 1999 4 manua… f 33 44 d comp… ## 3 volkswagen new beetle 1.9 1999 4 auto(… f 29 41 d subc… ## 4 honda civic 1.6 1999 4 manua… f 28 33 r subc… ## 5 toyota corolla 1.8 2008 4 manua… f 28 37 r comp… ## 6 honda civic 1.8 2008 4 manua… f 26 34 r subc…
dplyr arrange(): Sort by multiple columns
To sort by multiple columns, we specify the column names as argument to dplyr’s arrange() function. In the example below, we use arrange() function to sort rows by values of two columns, cyl and cty, number of cylinders in car and. city milage.
mpg %>% arrange(cyl, cty) ## # A tibble: 234 × 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 toyota 4runner 4… 2.7 1999 4 manu… 4 15 20 r suv ## 2 toyota toyota ta… 2.7 1999 4 manu… 4 15 20 r pick… ## 3 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp… ## 4 toyota 4runner 4… 2.7 1999 4 auto… 4 16 20 r suv ## 5 toyota toyota ta… 2.7 1999 4 auto… 4 16 20 r pick… ## 6 toyota toyota ta… 2.7 2008 4 manu… 4 17 22 r pick… ## 7 audi a4 1.8 1999 4 auto… f 18 29 p comp… ## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp… ## 9 dodge caravan 2… 2.4 1999 4 auto… f 18 24 r mini… ## 10 hyundai sonata 2.4 1999 4 auto… f 18 26 r mids… ## # … with 224 more rows
Here is an image visualizing the result of sorting a dataframe by two columns with. dplyr’s arrange()(thanks to TidyDataTutor.com). It highlighs two columns we are sorting by first and then shows how the rows are re-ordered after applying arrange() function. Notice the first value of the first column after sorting and the values of second columns. Each column’s values sorted in a hierarchy.
An important feature of dplyr’s arrange() to note is
Unlike other dplyr verbs, arrange() largely ignores grouping; you need to explicitly mention grouping variables (or use .by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group.