dplyr contains(): select columns that contains a string

In this tutorial, we will learn how to select columns, whose names contains a string using dplyr’s contains() function. dplyr’s contains() function belongs to a family helper functions to select columns like starts_with() and ends_with(). First we will see a simple example of using single string and selecting all columns that contains the string. And then we will learn how to use contains() function with multiple strings and select columns containing them.

To get started with some examples, let us load tidyverse and palmerpenguins package.

library(tidyvrerse)
library(palmerpenguins)
packageVersion("dplyr")

## [1] '1.0.9'

Our penguin data has 8 columns.

penguins %>% head(5)

## # A tibble: 5 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## # … with 1 more variable: year <int>

dplyr’s contains() to select columns matching a string

To select columns containing a string we use contains() function in combination with select() function in dplyr. In this example, we are selecting columns whose names contain the string “gth”. And we get two columns containing the string.

penguins %>%
  select(contains("gth"))

## # A tibble: 344 × 2
##    bill_length_mm flipper_length_mm
##             <dbl>             <int>
##  1           39.1               181
##  2           39.5               186
##  3           40.3               195
##  4           NA                  NA
##  5           36.7               193
##  6           39.3               190
##  7           38.9               181
##  8           39.2               195
##  9           34.1               193
## 10           42                 190
## # … with 334 more rows

dplyr’s contains() to select columns with multiple matching strings

We can use multiple strings and select columns whose names contain them. To do that we provide multiple strings as a vector to contains() function as shown below. In this example, we select columns which contain “gth” and “pth” in their names.

penguins %>%
  select(contains(c("gth", "pth")))
## # A tibble: 344 × 3
##    bill_length_mm flipper_length_mm bill_depth_mm
##             <dbl>             <int>         <dbl>
##  1           39.1               181          18.7
##  2           39.5               186          17.4
##  3           40.3               195          18  
##  4           NA                  NA          NA  
##  5           36.7               193          19.3
##  6           39.3               190          20.6
##  7           38.9               181          17.8
##  8           39.2               195          19.6
##  9           34.1               193          18.1
## 10           42                 190          20.2
## # … with 334 more rows

dplyr’s contains() cannot do regular expression

Note that dplyr’s contains can only do literal string match and it cannot do regular expression. For example, if we try to select all columns containing “pth” or “gth”, we will get an empty tibble.

penguins %>% 
  select(contains("[pg]th"))

## # A tibble: 344 × 0

dplyr’s contains() to select columns matching a string

dplyr’s contains() to select columns with multiple matching strings

dplyr’s contains() cannot do regular expression

Related