How to select rows with multiple partial matching strings

In this tutorial, we will learn how to select or filter rows of a dataframe with multiple partially matching strings. dplyr’s filter() function selects/filters rows based on values of one or more columns when it completely matches. However, if you want to select rows with partially matching strings in a column, we use filter() function in combination with str_detect() in R.

In this post, we will learn how to use filter() and str_detect() functions to detect multiple partial matrching string.

library(tidyverse)

We will use world population data built-in with tidyr package in tidyverse to learn how can we use grepl() and str_detect() functions to select partially matching rows.

population %>% head()
## # A tibble: 6 × 3
##   country      year population
##   <chr>       <int>      <int>
## 1 Afghanistan  1995   17586073
## 2 Afghanistan  1996   18415307
## 3 Afghanistan  1997   19021226
## 4 Afghanistan  1998   19496836
## 5 Afghanistan  1999   19987071
## 6 Afghanistan  2000   20595360

Filtering rows with partial match using str_detect()

Heee is an example of using stringr’s str_detect() function to detect “the presence or absence of a pattern in a string”. Note that while using str_detect() the variable or column name is the first argument and then the pattern of interest.

population %>% 
  filter(str_detect(country,"Ger"))

## # A tibble: 19 × 3
##    country  year population
##    <chr>   <int>      <int>
##  1 Germany  1995   83147770
##  2 Germany  1996   83388930
##  3 Germany  1997   83490697
##  4 Germany  1998   83500716
##  5 Germany  1999   83490881
##  6 Germany  2000   83512459
....
....

If we want to filter using multiple patterns, we can use str_detect() function in the following way. In this example using str_detect() function below, we filter rows if either “Ger” or “Aus” matches in country column. And this has given us rows corresponding the countries Australia and Germany.

population %>% 
  filter(str_detect(country,"Ger|Aus"))

## # A tibble: 57 × 3
##    country    year population
##    <chr>     <int>      <int>
##  1 Australia  1995   18124234
##  2 Australia  1996   18339037
##  3 Australia  1997   18563442
##  4 Australia  1998   18794552
##  5 Australia  1999   19027438
##  6 Australia  2000   19259377
##  7 Australia  2001   19487257
##  8 Australia  2002   19714625
##  9 Australia  2003   19953121
## 10 Australia  2004   20218481
## # … with 47 more rows
Exit mobile version