In this tutorial, we will learn how to replace all NA values in dataframe with a specific value like zero in R.
Create a dataframe with NA values
Let us get started with creating a dataframe with missing values,i.e. NAs in columns. We first create a vector with NAs using sample() function, where we sample a vector containing NAs – missing values with replacement.
set.seed(2020) data <- sample(c(1:5,NA), 50, replace = TRUE)
Our data looks like this.
data ## [1] 4 4 NA 1 1 4 2 NA 1 5 2 2 NA 5 2 3 2 5 4 2 NA NA 4 NA 4 ## [26] 2 4 5 4 4 3 NA 2 2 NA 3 5 4 5 5 2 5 1 NA 3 5 1 5 3 1
Let us convert our data vector into a matrix using the matrix() function. Here we specify a matrix with 5 columns.
data_mat <- matrix(data, ncol=5)
Our matrix with missing values look like this.
head(data_mat) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 4 2 NA 3 2 ## [2,] 4 2 NA NA 5 ## [3,] NA NA 4 2 1 ## [4,] 1 5 NA 2 NA ## [5,] 1 2 4 NA 3 ## [6,] 4 3 2 3 5
And then we convert the matrix into a dataframe using as.data.frame() function.
data_df<- as.data.frame(data_mat) head(data_df) ## V1 V2 V3 V4 V5 ## 1 4 2 NA 3 2 ## 2 4 2 NA NA 5 ## 3 NA NA 4 2 1 ## 4 1 5 NA 2 NA ## 5 1 2 4 NA 3 ## 6 4 3 2 3 5
Find the locations of NA values in R using is.na() function
To replace NAs with zeroes, we need to find which indices we have NAs. We will use is.na() function to find if an element in the dataframe is NA or not.
is.na(data_df) ## V1 V2 V3 V4 V5 ## [1,] FALSE FALSE TRUE FALSE FALSE ## [2,] FALSE FALSE TRUE TRUE FALSE ## [3,] TRUE TRUE FALSE FALSE FALSE ## [4,] FALSE FALSE TRUE FALSE TRUE ## [5,] FALSE FALSE FALSE TRUE FALSE ## [6,] FALSE FALSE FALSE FALSE FALSE ## [7,] FALSE FALSE FALSE FALSE FALSE ## [8,] TRUE FALSE FALSE FALSE FALSE ## [9,] FALSE FALSE FALSE FALSE FALSE ## [10,] FALSE FALSE FALSE FALSE FALSE
Replace all NA values to zeros in R
is.na() function gives us boolean dataframe and we can use that to replace NAs into zeros.
data_df[is.na(data_df)] <- 0
Now our dataframe does not have any NAs, we have replaced them with zeroes.
data_df ## V1 V2 V3 V4 V5 ## 1 4 2 0 3 2 ## 2 4 2 0 0 5 ## 3 0 0 4 2 1 ## 4 1 5 0 2 0 ## 5 1 2 4 0 3 ## 6 4 3 2 3 5 ## 7 2 2 4 5 1 ## 8 0 5 5 4 5 ## 9 1 4 4 5 3 ## 10 5 2 4 5 1
Replace all NA values to some specific numerical value
As you can see we can replace all NAs with some specific value. In this example, we replace all NAs with 1000.
data_df<- as.data.frame(data_mat) data_df[is.na(data_df)] <- 1000 data_df ## V1 V2 V3 V4 V5 ## 1 4 2 1000 3 2 ## 2 4 2 1000 1000 5 ## 3 1000 1000 4 2 1 ## 4 1 5 1000 2 1000 ## 5 1 2 4 1000 3 ## 6 4 3 2 3 5 ## 7 2 2 4 5 1 ## 8 1000 5 5 4 5 ## 9 1 4 4 5 3 ## 10 5 2 4 5 1
P.S. NAs are often missing values in a useful way. Before you replace all NAs into zeros or something else, one needs to make sure that is the right thing to go. The whole area of imputing missing values is active area in statistics.