With dplyr’s filter() function, one can filter rows or subset rows from a dataframe. In this post, we will learn how to select subset of rows based on values one or more columns in a dataframe using dplyr’s filter() function.
First we load tidyverse the suit of R packages.
library(tidyverse)
We will use world population dataset available built in R from tidy package.
World population data over the years have three columns like this.
population %>% head() ## # A tibble: 6 × 3 ## country year population ## <chr> <int> <int> ## 1 Afghanistan 1995 17586073 ## 2 Afghanistan 1996 18415307 ## 3 Afghanistan 1997 19021226 ## 4 Afghanistan 1998 19496836 ## 5 Afghanistan 1999 19987071 ## 6 Afghanistan 2000 20595360
dplyr filter() Example 1: Select based on a value of a column
Let us filter or subset the dataframe based on one of the column’s value. In this example, we are selecting or subsetting rows whose value of country equals “Germany”. This filter() operation gives us smaller dataframe with Germany’s population data.
population %>% filter(country == "Germany") ## # A tibble: 19 × 3 ## country year population ## <chr> <int> <int> ## 1 Germany 1995 83147770 ## 2 Germany 1996 83388930 ## 3 Germany 1997 83490697 ## 4 Germany 1998 83500716 ## 5 Germany 1999 83490881 ## 6 Germany 2000 83512459 ## 7 Germany 2001 83583461 ## 8 Germany 2002 83685160 ## 9 Germany 2003 83788480 ## 10 Germany 2004 83848844 ## 11 Germany 2005 83835978 ## 12 Germany 2006 83740302 ## 13 Germany 2007 83578794 ## 14 Germany 2008 83379538 ## 15 Germany 2009 83182774 ## 16 Germany 2010 83017404 ## 17 Germany 2011 82892904 ## 18 Germany 2012 82800121 ## 19 Germany 2013 82726626
Although this example had used equality sign, “==” to make the comparison, other common comparisons we often make and readily usable with dplyr’s filter() function are
- != inequality sign
- < less than sign
- > greater than sign
- <= less than or equal to sign
- >= greater than or equal to sign
dplyr filter() Example 2: Select based on values of two columns
In the second example illustrating the use of filter() function, we show how we can select or filter rows based on values of more than one column. And also we will use two types of comparison, one equality sign and the other greater than sign.
Here we select rows where country is equal to Germany and the year is > 2010. We combine the two conditions with & symbol as we want both to be satisfied.
population %>% filter(country == "Germany" & year > 2010) ## # A tibble: 3 × 3 ## country year population ## <chr> <int> <int> ## 1 Germany 2011 82892904 ## 2 Germany 2012 82800121 ## 3 Germany 2013 82726626
dplyr filter() Example 3: Select based on values of a single column
In the third example, we show how to select or filter rows of a dataframe for multiple values of a single column. In this example we %in% operator instead of equality sign to select two countries.
population %>% filter(country %in% c("Germany", "Australia")) ## # A tibble: 38 × 3 ## country year population ## <chr> <int> <int> ## 1 Australia 1995 18124234 ## 2 Australia 1996 18339037 ## 3 Australia 1997 18563442 ## 4 Australia 1998 18794552 ## 5 Australia 1999 19027438 ## 6 Australia 2000 19259377 ## 7 Australia 2001 19487257 ## 8 Australia 2002 19714625 ## 9 Australia 2003 19953121 ## 10 Australia 2004 20218481 ## # … with 28 more rows
[…] tutorial, we will learn how to select or filter rows of a dataframe with partially matching string. dplyr’s filter() function selects/filters rows based on values of one or more columns when it completely matches. However, to filter or select […]