dplyr filter(): How to select rows with partially matching string

In this tutorial, we will learn how to select or filter rows of a dataframe with partially matching string. dplyr’s filter() function selects/filters rows based on values of one or more columns when it completely matches. However, to filter or select rows with partially matching strings in a column, we can use filter with additional functions in R. In this post, we will learn how to use two approaches, one using grepl() function from base R and str_detect() function from stringr package to select rows that partially match.

To get started let us load tidyverse, the suite of R packages.

library(tidyverse)

We will use world population data built-in with tidyr package in tidyverse to learn how can we use grepl() and str_detect() functions to select partially matching rows.

population %>% head()
## # A tibble: 6 × 3
##   country      year population
##   <chr>       <int>      <int>
## 1 Afghanistan  1995   17586073
## 2 Afghanistan  1996   18415307
## 3 Afghanistan  1997   19021226
## 4 Afghanistan  1998   19496836
## 5 Afghanistan  1999   19987071
## 6 Afghanistan  2000   20595360

Filtering rows with partial match using grepl()

grepl() function from base R is a close relative grep() function and it takes a pattern and a vector or text and returns a boolean vector with True if the pattern matches or False if it does not. By default, grepl() does not ignore case, but with ignore.case=TRUE we can make grepl() to ignore the case while matching.


grepl(pattern, 
      x,
      ignore.case = FALSE)

To filter rows with partial match we will use filter() function as before, but this time with grepl() as argument. In the example below, we are looking for matching pattern “Germ” as a pattern and country as vector to look for match. Here, grepl() will return True when country column partially match for “Germ” and possibly others as well.

population %>% 
  filter(grepl("Germ",country)) %>%
  head()

## # A tibble: 6 × 3
##   country  year population
##   <chr>   <int>      <int>
## 1 Germany  1995   83147770
## 2 Germany  1996   83388930
## 3 Germany  1997   83490697
## 4 Germany  1998   83500716
## 5 Germany  1999   83490881
## 6 Germany  2000   83512459

With grepl() we can also use regular expression to describe pattern. For example, to select countries that end with “any” we use

population %>% 
  filter(grepl("any$", country))

## # A tibble: 19 × 3
##    country  year population
##    <chr>   <int>      <int>
##  1 Germany  1995   83147770
##  2 Germany  1996   83388930
##  3 Germany  1997   83490697
##  4 Germany  1998   83500716
##  5 Germany  1999   83490881
##  6 Germany  2000   83512459

Here is another example of using simple regex, but this time getting countries that start with “Ger”

population %>% 
  filter(grepl("^Ger", country))

## # A tibble: 19 × 3
##    country  year population
##    <chr>   <int>      <int>
##  1 Germany  1995   83147770
##  2 Germany  1996   83388930
##  3 Germany  1997   83490697
##  4 Germany  1998   83500716
##  5 Germany  1999   83490881
##  6 Germany  2000   83512459
...
...

Filtering rows with partial match using str_detect()

Another equivalent function available for filtering rows with partial match is str_detect() function in stringr package. As the name suggests, str_detect() “detects the presence or absence of a pattern in a string”. It is equivalent to grepl(). Note that in contrast to grepl(), the variable name is the first argument and then the pattern of interest while using str_detect().

population %>% 
  filter(str_detect(country,"Ger"))

## # A tibble: 19 × 3
##    country  year population
##    <chr>   <int>      <int>
##  1 Germany  1995   83147770
##  2 Germany  1996   83388930
##  3 Germany  1997   83490697
##  4 Germany  1998   83500716
##  5 Germany  1999   83490881
##  6 Germany  2000   83512459
....
....

Like grepl() function, we can use regexp to filter rows using str_detect() function. In the example below, We are selecting rows based on values starting with a prefix.

population %>% 
  filter(str_detect(country, "^Ger"))

## # A tibble: 19 × 3
##    country  year population
##    <chr>   <int>      <int>
##  1 Germany  1995   83147770
##  2 Germany  1996   83388930
##  3 Germany  1997   83490697
##  4 Germany  1998   83500716
##  5 Germany  1999   83490881
##  6 Germany  2000   83512459
...
...