How to Randomly Replace Values of Numerical Columns in a dataframe to NAs

In this tutorial, we will learn how to randomly replace values of numerical columns in a dataframe to NAs or missing values. We will use dplyr’s across() function to select only the numerical columns from a dataframe and probabilistically select certain percent of elements to change to NAs.

First, let us load tidyverse packages and palmer penguin dataset.

library(tidyverse)
library(palmerpenguins)

Let us start with a dataframe without any missing values by dropping any rows with missing data.

df <- penguins %>%
  drop_na()

df
## # A tibble: 333 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # … with 323 more rows, and 2 more variables: sex <fct>, year <int>

The approach we will take to replace random values of numerical columns to NAs is as follows. We will use dplyr’ across() function to randomly replace elements to NAs column wise.

With sample() function we are creating a boolean vector of the size of each column with probability for TRUE and FALSE. And we use ifelse() function to make the elements with FALSE to be replaced by NAs.

df_na <- df %>% 
  mutate(across(where(is.numeric),
                ~ifelse(sample(c(TRUE, FALSE),
                               size = length(.),
                               replace = TRUE, 
                               prob = c(0.9, 0.1)),
                        ., NA)))

Here is how our result look like.

df_na

## # A tibble: 333 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186          NA
##  3 Adelie  Torgersen           40.3          NA                  NA        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           NA            21.1               198        4400
## # … with 323 more rows, and 2 more variables: sex <fct>, year <int>

We roughly replaced 10% of each column elements to NAs. Let us verify that by counting the number of NAs in each column. Again we use across() function in dplyr and count the number of NAs in all the numerical columns. Since we know that there 333 elements in each column, we see that we have 10% NAs approximately in each column.

df_na %>%
  summarize(across(where(is.numeric), 
                function(x){sum(is.na(x))}))

## # A tibble: 1 × 5
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##            <int>         <int>             <int>       <int> <int>
## 1             41            34                35          35    26

Related