In this tutorial, we will learn how to replace missing values NAs in a data set with column means using tidyverse functions. Dealing with missing data is very important in doing data analysis. Understanding the origin of missing values NAs is very important to make a decision on what to do with them.
In earlier post, we learned how to replace NAs with column means using base R functions. Here we use tidyverse functions to do the same.
First, let us load tidyverse suit of R packages.
library(tidyverse)
Create a dataframe with missing values()
Then, let us create a simple dataframe, a tibble, with some missing values.
set.seed(49) data_df <- tibble(a = sample(c(1:3,NA), 5, replace = TRUE), b = sample(c(1:3,NA), 5, replace = TRUE), c = sample(c(1:3,NA), 5, replace = TRUE))
Here is how our toy dataframe with NAs look like.
data_df ## # A tibble: 5 × 3 ## a b c ## <int> <int> <int> ## 1 3 3 NA ## 2 1 NA 3 ## 3 3 2 1 ## 4 1 3 NA ## 5 3 3 1
Let us add unique row ID using row_number() function in dplyr package part of tidyverse.
data_df <- data_df %>% mutate(row_id = row_number())
data_df ## # A tibble: 5 × 4 ## a b c row_id ## <int> <int> <int> <int> ## 1 3 3 NA 1 ## 2 1 NA 3 2 ## 3 3 2 1 3 ## 4 1 3 NA 4 ## 5 3 3 1 5
Reshape wide dataframe to long dataframe using pivot_longer() with column names as one of the columns
The basic idea to replace NAs with column mean using tidyverse is that we will group_by() each column and replace the NAs in each column with the column mean. To do that we first need to reshape our data frame in wide form to long form using pivot_longer() function in tidyr.
The tidy dataframe has three columns, one for row ID, one for column names and the other for actual values in the dataframe.
data_df %>% pivot_longer(-row_id, names_to="column_id", values_to="val") ## # A tibble: 15 × 3 ## row_id column_id val ## <int> <chr> <int> ## 1 1 a 3 ## 2 1 b 3 ## 3 1 c NA ## 4 2 a 1 ## 5 2 b NA ## 6 2 c 3 ## 7 3 a 3 ## 8 3 b 2 ## 9 3 c 1 ## 10 4 a 1 ## 11 4 b 3 ## 12 4 c NA ## 13 5 a 3 ## 14 5 b 3 ## 15 5 c 1
Use group_by() function to replace NAs in each column
Now we can group_by() column name ID and change the values using mutate() function. Here we check each value in a column and replace it with a column mean if it is NA. We use ifelse() function to check if each element is NA or not.
data_df %>% pivot_longer(-row_id, names_to="column_id", values_to="val") %>% group_by(column_id) %>% mutate(val= ifelse(is.na(val), mean(val, na.rm = T), val)) ## # A tibble: 15 × 3 ## # Groups: column_id [3] ## row_id column_id val ## <int> <chr> <dbl> ## 1 1 a 3 ## 2 1 b 3 ## 3 1 c 1.67 ## 4 2 a 1 ## 5 2 b 2.75 ## 6 2 c 3 ## 7 3 a 3 ## 8 3 b 2 ## 9 3 c 1 ## 10 4 a 1 ## 11 4 b 3 ## 12 4 c 1.67 ## 13 5 a 3 ## 14 5 b 3 ## 15 5 c 1
Reshape long dataframe to original wide dataframe using pivot_wider()
Now that we have replaced all the missing values with column means, we can reshape the dataframe into wider for that we started with using pivot_wider() function.
data_df %>% pivot_longer(-row_id, names_to="column_id", values_to="val") %>% group_by(column_id) %>% mutate(val= ifelse(is.na(val), mean(val,na.rm = T), val)) %>% pivot_wider(names_from="column_id", values_from = "val") ## # A tibble: 5 × 4 ## row_id a b c ## <int> <dbl> <dbl> <dbl> ## 1 1 3 3 1.67 ## 2 2 1 2.75 3 ## 3 3 3 2 1 ## 4 4 1 3 1.67 ## 5 5 3 3 1
If needed we can removed the unique row ID as well.
data_df %>% pivot_longer(-row_id, names_to="column_id", values_to="val") %>% group_by(column_id) %>% mutate(val= ifelse(is.na(val), mean(val,na.rm = T), val)) %>% pivot_wider(names_from="column_id", values_from = "val") %>% select(-row_id) ## # A tibble: 5 × 3 ## a b c ## <dbl> <dbl> <dbl> ## 1 3 3 1.67 ## 2 1 2.75 3 ## 3 3 2 1 ## 4 1 3 1.67 ## 5 3 3 1
1 comment
Comments are closed.