• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Rstats 101

Learn R Programming Tips & Tricks for Statistics and Data Science

  • Home
  • About
    • Privacy Policy
  • Show Search
Hide Search

How to Randomly Replace Values of Numerical Columns in a dataframe to NAs

rstats101 · August 16, 2022 ·

In this tutorial, we will learn how to randomly replace values of numerical columns in a dataframe to NAs or missing values. We will use dplyr’s across() function to select only the numerical columns from a dataframe and probabilistically select certain percent of elements to change to NAs.

First, let us load tidyverse packages and palmer penguin dataset.

library(tidyverse)
library(palmerpenguins)

Let us start with a dataframe without any missing values by dropping any rows with missing data.

df <- penguins %>%
  drop_na()

df
## # A tibble: 333 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # … with 323 more rows, and 2 more variables: sex <fct>, year <int>

The approach we will take to replace random values of numerical columns to NAs is as follows. We will use dplyr’ across() function to randomly replace elements to NAs column wise.

With sample() function we are creating a boolean vector of the size of each column with probability for TRUE and FALSE. And we use ifelse() function to make the elements with FALSE to be replaced by NAs.

df_na <- df %>% 
  mutate(across(where(is.numeric),
                ~ifelse(sample(c(TRUE, FALSE),
                               size = length(.),
                               replace = TRUE, 
                               prob = c(0.9, 0.1)),
                        ., NA)))

Here is how our result look like.

df_na

## # A tibble: 333 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186          NA
##  3 Adelie  Torgersen           40.3          NA                  NA        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           NA            21.1               198        4400
## # … with 323 more rows, and 2 more variables: sex <fct>, year <int>

We roughly replaced 10% of each column elements to NAs. Let us verify that by counting the number of NAs in each column. Again we use across() function in dplyr and count the number of NAs in all the numerical columns. Since we know that there 333 elements in each column, we see that we have 10% NAs approximately in each column.

df_na %>%
  summarize(across(where(is.numeric), 
                function(x){sum(is.na(x))}))

## # A tibble: 1 × 5
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##            <int>         <int>             <int>       <int> <int>
## 1             41            34                35          35    26

Related

Filed Under: dplyr across(), NAs in R Tagged With: randomly replace values to NAs

Primary Sidebar

Recent Posts

  • How to create a nested dataframe with lists
  • How to compute proportion with tidyverse
  • How to Compute Z-Score of Multiple Columns
  • How to drop unused level of factor variable in R
  • How to compute Z-score

Categories

%in% arrange() as.data.frame as_tibble built-in data R colSums() R cor() in R data.frame dplyr dplyr across() dplyr group_by() dplyr rename() dplyr rowwise() dplyr row_number() dplyr select() dplyr slice_max() dplyr slice_sample() drop_na R duplicated() gsub head() impute with mean values is.element() linear regression matrix() function na.omit R NAs in R near() R openxlsx pivot_longer() prod() R.version replace NA replace NAs tidyverse R Function rstats rstats101 R version scale() sessionInfo() t.test() tidyr tidyselect tidyverse write.xlsx

Copyright © 2025 · Daily Dish Pro on Genesis Framework · WordPress · Log in

Go to mobile version