In this tutorial, we will learn how to select or filter rows of a dataframe with partially matching string. dplyr’s filter() function selects/filters rows based on values of one or more columns when it completely matches. However, to filter or select rows with partially matching strings in a column, we can use filter with additional functions in R. In this post, we will learn how to use two approaches, one using grepl() function from base R and str_detect() function from stringr package to select rows that partially match.
To get started let us load tidyverse, the suite of R packages.
library(tidyverse)
We will use world population data built-in with tidyr package in tidyverse to learn how can we use grepl() and str_detect() functions to select partially matching rows.
population %>% head() ## # A tibble: 6 × 3 ## country year population ## <chr> <int> <int> ## 1 Afghanistan 1995 17586073 ## 2 Afghanistan 1996 18415307 ## 3 Afghanistan 1997 19021226 ## 4 Afghanistan 1998 19496836 ## 5 Afghanistan 1999 19987071 ## 6 Afghanistan 2000 20595360
Filtering rows with partial match using grepl()
grepl() function from base R is a close relative grep() function and it takes a pattern and a vector or text and returns a boolean vector with True if the pattern matches or False if it does not. By default, grepl() does not ignore case, but with ignore.case=TRUE we can make grepl() to ignore the case while matching.
grepl(pattern, x, ignore.case = FALSE)
To filter rows with partial match we will use filter() function as before, but this time with grepl() as argument. In the example below, we are looking for matching pattern “Germ” as a pattern and country as vector to look for match. Here, grepl() will return True when country column partially match for “Germ” and possibly others as well.
population %>% filter(grepl("Germ",country)) %>% head() ## # A tibble: 6 × 3 ## country year population ## <chr> <int> <int> ## 1 Germany 1995 83147770 ## 2 Germany 1996 83388930 ## 3 Germany 1997 83490697 ## 4 Germany 1998 83500716 ## 5 Germany 1999 83490881 ## 6 Germany 2000 83512459
With grepl() we can also use regular expression to describe pattern. For example, to select countries that end with “any” we use
population %>% filter(grepl("any$", country)) ## # A tibble: 19 × 3 ## country year population ## <chr> <int> <int> ## 1 Germany 1995 83147770 ## 2 Germany 1996 83388930 ## 3 Germany 1997 83490697 ## 4 Germany 1998 83500716 ## 5 Germany 1999 83490881 ## 6 Germany 2000 83512459
Here is another example of using simple regex, but this time getting countries that start with “Ger”
population %>% filter(grepl("^Ger", country)) ## # A tibble: 19 × 3 ## country year population ## <chr> <int> <int> ## 1 Germany 1995 83147770 ## 2 Germany 1996 83388930 ## 3 Germany 1997 83490697 ## 4 Germany 1998 83500716 ## 5 Germany 1999 83490881 ## 6 Germany 2000 83512459 ... ...
Filtering rows with partial match using str_detect()
Another equivalent function available for filtering rows with partial match is str_detect() function in stringr package. As the name suggests, str_detect() “detects the presence or absence of a pattern in a string”. It is equivalent to grepl(). Note that in contrast to grepl(), the variable name is the first argument and then the pattern of interest while using str_detect().
population %>% filter(str_detect(country,"Ger")) ## # A tibble: 19 × 3 ## country year population ## <chr> <int> <int> ## 1 Germany 1995 83147770 ## 2 Germany 1996 83388930 ## 3 Germany 1997 83490697 ## 4 Germany 1998 83500716 ## 5 Germany 1999 83490881 ## 6 Germany 2000 83512459 .... ....
Like grepl() function, we can use regexp to filter rows using str_detect() function. In the example below, We are selecting rows based on values starting with a prefix.
population %>% filter(str_detect(country, "^Ger")) ## # A tibble: 19 × 3 ## country year population ## <chr> <int> <int> ## 1 Germany 1995 83147770 ## 2 Germany 1996 83388930 ## 3 Germany 1997 83490697 ## 4 Germany 1998 83500716 ## 5 Germany 1999 83490881 ## 6 Germany 2000 83512459 ... ...