In this tutorial, we will learn how to select or filter rows of a dataframe with multiple partially matching strings. dplyr’s filter() function selects/filters rows based on values of one or more columns when it completely matches. However, if you want to select rows with partially matching strings in a column, we use filter() function in combination with str_detect() in R.
In this post, we will learn how to use filter() and str_detect() functions to detect multiple partial matrching string.
library(tidyverse)
We will use world population data built-in with tidyr package in tidyverse to learn how can we use grepl() and str_detect() functions to select partially matching rows.
population %>% head() ## # A tibble: 6 × 3 ## country year population ## <chr> <int> <int> ## 1 Afghanistan 1995 17586073 ## 2 Afghanistan 1996 18415307 ## 3 Afghanistan 1997 19021226 ## 4 Afghanistan 1998 19496836 ## 5 Afghanistan 1999 19987071 ## 6 Afghanistan 2000 20595360
Filtering rows with partial match using str_detect()
Heee is an example of using stringr’s str_detect() function to detect “the presence or absence of a pattern in a string”. Note that while using str_detect() the variable or column name is the first argument and then the pattern of interest.
population %>% filter(str_detect(country,"Ger")) ## # A tibble: 19 × 3 ## country year population ## <chr> <int> <int> ## 1 Germany 1995 83147770 ## 2 Germany 1996 83388930 ## 3 Germany 1997 83490697 ## 4 Germany 1998 83500716 ## 5 Germany 1999 83490881 ## 6 Germany 2000 83512459 .... ....
If we want to filter using multiple patterns, we can use str_detect() function in the following way. In this example using str_detect() function below, we filter rows if either “Ger” or “Aus” matches in country column. And this has given us rows corresponding the countries Australia and Germany.
population %>% filter(str_detect(country,"Ger|Aus")) ## # A tibble: 57 × 3 ## country year population ## <chr> <int> <int> ## 1 Australia 1995 18124234 ## 2 Australia 1996 18339037 ## 3 Australia 1997 18563442 ## 4 Australia 1998 18794552 ## 5 Australia 1999 19027438 ## 6 Australia 2000 19259377 ## 7 Australia 2001 19487257 ## 8 Australia 2002 19714625 ## 9 Australia 2003 19953121 ## 10 Australia 2004 20218481 ## # … with 47 more rows