In this tutorial, we will learn how to randomly replace values of numerical columns in a dataframe to NAs or missing values. We will use dplyr’s across() function to select only the numerical columns from a dataframe and probabilistically select certain percent of elements to change to NAs.
First, let us load tidyverse packages and palmer penguin dataset.
library(tidyverse) library(palmerpenguins)
Let us start with a dataframe without any missing values by dropping any rows with missing data.
df <- penguins %>% drop_na() df ## # A tibble: 333 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen 36.7 19.3 193 3450 ## 5 Adelie Torgersen 39.3 20.6 190 3650 ## 6 Adelie Torgersen 38.9 17.8 181 3625 ## 7 Adelie Torgersen 39.2 19.6 195 4675 ## 8 Adelie Torgersen 41.1 17.6 182 3200 ## 9 Adelie Torgersen 38.6 21.2 191 3800 ## 10 Adelie Torgersen 34.6 21.1 198 4400 ## # … with 323 more rows, and 2 more variables: sex <fct>, year <int>
The approach we will take to replace random values of numerical columns to NAs is as follows. We will use dplyr’ across() function to randomly replace elements to NAs column wise.
With sample() function we are creating a boolean vector of the size of each column with probability for TRUE and FALSE. And we use ifelse() function to make the elements with FALSE to be replaced by NAs.
df_na <- df %>% mutate(across(where(is.numeric), ~ifelse(sample(c(TRUE, FALSE), size = length(.), replace = TRUE, prob = c(0.9, 0.1)), ., NA)))
Here is how our result look like.
df_na ## # A tibble: 333 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 NA ## 3 Adelie Torgersen 40.3 NA NA 3250 ## 4 Adelie Torgersen 36.7 19.3 193 3450 ## 5 Adelie Torgersen 39.3 20.6 190 3650 ## 6 Adelie Torgersen 38.9 17.8 181 3625 ## 7 Adelie Torgersen 39.2 19.6 195 4675 ## 8 Adelie Torgersen 41.1 17.6 182 3200 ## 9 Adelie Torgersen 38.6 21.2 191 3800 ## 10 Adelie Torgersen NA 21.1 198 4400 ## # … with 323 more rows, and 2 more variables: sex <fct>, year <int>
We roughly replaced 10% of each column elements to NAs. Let us verify that by counting the number of NAs in each column. Again we use across() function in dplyr and count the number of NAs in all the numerical columns. Since we know that there 333 elements in each column, we see that we have 10% NAs approximately in each column.
df_na %>% summarize(across(where(is.numeric), function(x){sum(is.na(x))})) ## # A tibble: 1 × 5 ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year ## <int> <int> <int> <int> <int> ## 1 41 34 35 35 26