In this tutorial, we will learn how to drop columns with values that are all NAs. We will use two approaches to remove columns with all NAs. First, we will use tidyverse approach, where we perform column-wise operation to see all values are NAs and select columns that are not all NAs. Next we will use base R approach by counting the number NAs per columns using apply() function and select columns that are not all NAs.
First let us load tidyverse meta package.
library(tidyverse)
Create a dataframe with a column of all NAs
To create a dataframe with missing values we use a vector with more missing values than non-missing values.
x <- c(1:5,rep(NA,10)) x [1] 1 2 3 4 5 NA NA NA NA NA NA NA NA NA NA
And create a toy dataframe with 3 columns and 5 rows with one of the columns is all NAs.
set.seed(2022) df <- tibble(C1= sample(x,5), C2= sample(x,5), C3= sample(x,5)) df # A tibble: 5 × 3 C1 C2 C3 <int> <int> <int> 1 4 NA NA 2 3 NA NA 3 NA NA NA 4 NA NA 2 5 NA NA NA
Removing columns with all NAs with tidyverse
Using tidyverse approach we remove one or more columns with all NAs using select() function. Here instead of selecting columns by names, we select columns that are all NAs. We use an anonymous function to find if a column is all NAs.
df %>% select(where(function(x) any(!is.na(x)))) # A tibble: 5 × 2 C1 C3 <int> <int> 1 4 NA 2 3 NA 3 NA NA 4 NA 2 5 NA NA
In the above example we have one column with all NAs. Here is the second example where we remove multiple columns with all NAs
set.seed(2202) x <- c(1:3,rep(NA,10)) df2 <- tibble(C1= sample(x,5), C2= sample(x,5), C3= sample(x,5), C4= sample(x,5), C5= sample(x,5)) df2 # A tibble: 5 × 5 C1 C2 C3 C4 C5 <int> <int> <int> <int> <int> 1 3 NA NA NA NA 2 NA NA NA NA 3 3 NA NA NA NA NA 4 NA NA NA NA 1 5 NA NA 1 NA NA
Our dataframe has two columns with all NAs.
df2 %>% select(where(function(x) any(!is.na(x)))) # A tibble: 5 × 3 C1 C3 C5 <int> <int> <int> 1 3 NA NA 2 NA NA 3 3 NA NA NA 4 NA NA 1 5 NA 1 NA
Removing columns with all NAs use base R
To remove columns with all NAs using base R approach, we first compute the number of missing values per column using apply() function.
n_NAs <- apply(df, 2, function(x){sum(is.na(x))}) n_NAs C1 C2 C3 3 5 4
Then we select columns with fewer NAs by checking if the number of NAs is smaller than the number of rows.
df[ ,n_NAs < nrow(df)] # A tibble: 5 × 2 C1 C3 <int> <int> 1 4 NA 2 3 NA 3 NA NA 4 NA 2 5 NA NA
As before now we see an example of using base R approach to remove multiple columns with all NAs. In this example, we use the dataframe with two columns of all NAs and remove them both using base R approach.
n_NAs <- apply(df2, 2, function(x){sum(is.na(x))}) n_NAs C1 C2 C3 C4 C5 4 5 4 5 3
df2[ ,n_NAs < nrow(df2)] # A tibble: 5 × 3 C1 C3 C5 <int> <int> <int> 1 3 NA NA 2 NA NA 3 3 NA NA NA 4 NA NA 1 5 NA 1 NA