In this tutorial we will learn how to remove rows containing missing values using drop_na() function available in tidyr package in R. drop_na() available in tidyverse is a versatile function. First we will see an example of removing all rows with at least one missing values using drop_na() and then we can selectively inspect a specific column and remove rows with missing values based on that select column.
First, let us load tidyverse suite of R packages that include tidyr.
library(tidyverse) ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ── ## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.2 ✓ dplyr 1.0.7 ## ✓ tidyr 1.1.3 ✓ stringr 1.4.0 ## ✓ readr 1.4.0 ✓ forcats 0.5.0 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag()
Then, let us create a sample data frame with missing values in multiple columns using tibble() function available in tidyverse.
df <- tibble(col1 = letters[1:5], col2 = c(10,20,NA,40,50), col3 = c(10,NA,30,40,NA), col4 = c(1:4,NA))
Our example dataframe contains four missing values denoted as NA in three columns and three rows.
## # A tibble: 5 x 4 ## col1 col2 col3 col4 ## <chr> <dbl> <dbl> <int> ## 1 a 10 10 1 ## 2 b 20 NA 2 ## 3 c NA 30 3 ## 4 d 40 40 4 ## 5 e 50 NA NA
We can use tidyr’s drop_na() function to drop all rows with missing values. And we get a resulting dataframe containing two rows with no missing values.
df %>% tidyr::drop_na() ## # A tibble: 2 x 4 ## col1 col2 col3 col4 ## <chr> <dbl> <dbl> <int> ## 1 a 10 10 1 ## 2 d 40 40 4
In the above example, we used magritter’s pipe operator %>% to feed the dataframe to drop_na() function. We can also proved the data frame as argument to drop_na() function to get the same results.
tidyr::drop_na(df) ## # A tibble: 2 x 4 ## col1 col2 col3 col4 ## <chr> <dbl> <dbl> <int> ## 1 a 10 10 1 ## 2 d 40 40 4
If we have loaded the tidyr package, we can directly use the function drop_na() without mentioning the package name in the beginning.
df %>% drop_na() ## # A tibble: 2 x 4 ## col1 col2 col3 col4 ## <chr> <dbl> <dbl> <int> ## 1 a 10 10 1 ## 2 d 40 40 4
Remove rows based a column’s missing values using drop_na() in R
By default, drop_na() function removes all rows with NAs. Some times you might want to remove rows based on a column’s missing values.
tidyr’s drop_na() can take one or more columns as input and drop missing values in the specified column. For example, here we have removed rows based on third column’s missing value. Note that the resulting dataframe still have missing value on second row from the second column.
df %>% drop_na(col3) ## # A tibble: 3 x 4 ## col1 col2 col3 col4 ## <chr> <dbl> <dbl> <int> ## 1 a 10 10 1 ## 2 c NA 30 3 ## 3 d 40 40 4
There is always more than one solutions to a problem. We can also remove rows with missing values using base R function na.omit() available in stats package part of base R.
Check this post to learn how to use na.omit() to remove rows with missing values in a data frame or a matrix.