How to select only numeric columns in a dataframe

In this tutorial, we will learn how to select the columns that are numeric from a dataframe containing columns of different datatype. We will use dplyr’s select() function in combination with where() and is.numeric() functions to select the numeric columns.

How to select all numerical columns from a dataframe
How to select all numerical columns from a dataframe

Let us first load tidyverse and palmer penguin datasets to illustrate selecting numerical variables.

library(tidyverse)
library(palmerpenguins)

select() frunction we will be using is from dplyr package and here check the installed dplyr version.

packageVersion("dplyr")
## [1] '1.0.8'

By looking at the penguins data we can see we have factor variables, integer and double variables.

penguins

## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

Select All Numerical Columns, without using the names

We can select all the numerical columns from the dataframe without actually specifying the names of the numerical columns by using their datatypes. is.numeric() function can tell us if the variable is numerical or not. is.numeric identifies both double and integer variables as numeric.
And where() function is a selection helper that selects the variables for which a function returns TRUE. In our example is.numeric() is true for numerical variables FALSE for others.

penguins %>%
  select(where(is.numeric))
## # A tibble: 344 × 5
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##             <dbl>         <dbl>             <int>       <int> <int>
##  1           39.1          18.7               181        3750  2007
##  2           39.5          17.4               186        3800  2007
##  3           40.3          18                 195        3250  2007
##  4           NA            NA                  NA          NA  2007
##  5           36.7          19.3               193        3450  2007
##  6           39.3          20.6               190        3650  2007
##  7           38.9          17.8               181        3625  2007
##  8           39.2          19.6               195        4675  2007
##  9           34.1          18.1               193        3475  2007
## 10           42            20.2               190        4250  2007
## # … with 334 more rows

Get this error? “Predicate functions must be wrapped in `where()`”.

Often we might tend to forget using where() function when trying to select numerical columns like this.

penguins %>%
  select(is.numeric)

However, this would give us the following warning once advising us to use where() and give the same results for now.

## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.numeric)
## 
##   # Good
##   data %>% select(where(is.numeric))
## 
## ℹ Please update your code.
## This message is displayed once per session.
Exit mobile version