How to Replace NA values with Column Mean

One of the ways to deal with missing values (or NAs) is to replace them with mean values (under certain assumptions). The process is called imputation. In this tutorial, we will learn how to replace missing values (NAs) with column mean. First we will create some data with missing values and then show how to replace NAs with column mean values in a matrix. And then we will show to how to replace NAs with mean column values in a data frame.

Create a Matrix/Dataframe with NAs

Let us use sample() function to create some data with missing values – NAs.

set.seed(2020)
data <- sample(c(1:5,NA), 50, replace = TRUE)
data
##  [1]  4  4 NA  1  1  4  2 NA  1  5  2  2 NA  5  2  3  2  5  4  2 NA NA  4 NA  4
## [26]  2  4  5  4  4  3 NA  2  2 NA  3  5  4  5  5  2  5  1 NA  3  5  1  5  3  1

First, we will create a matrix using the data with missing values.

data_mat <- matrix(data, ncol=5)
head(data_mat)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    2   NA    3    2
## [2,]    4    2   NA   NA    5
## [3,]   NA   NA    4    2    1
## [4,]    1    5   NA    2   NA
## [5,]    1    2    4   NA    3
## [6,]    4    3    2    3    5

Then, we will create a dataframe using the data with missing values.

data_df<- as.data.frame(data_mat)

head(data_df)
##   V1 V2 V3 V4 V5
## 1  4  2 NA  3  2
## 2  4  2 NA NA  5
## 3 NA NA  4  2  1
## 4  1  5 NA  2 NA
## 5  1  2  4 NA  3
## 6  4  3  2  3  5

How to replace NAs with column means in a matrix in R?

Now let us see how to impute NAs with column mean in a matrix. We need means of all the columns in the matrix. We can use colMeans() function compute column means.

col_means <- colMeans(data_mat, na.rm=TRUE)
col_means
## [1] 2.750000 3.000000 3.857143 3.625000 2.888889

We will be use the column mean value sto substitute it for NAs in the data matrix. Let us find the locations of NAs in the matrix. We can use which() function with arr.ind=TRUE to get row and column positions of missing values.

na_ind <- which(is.na(data_mat), arr.ind=TRUE)
na_ind

##       row col
##  [1,]   3   1
##  [2,]   8   1
##  [3,]   3   2
##  [4,]   1   3
##  [5,]   2   3
##  [6,]   4   3
##  [7,]   2   4
##  [8,]   5   4
##  [9,]   4   5

Now we can simply replace column mean values at these NA locations.

data_mat[na_ind] <- col_means[na_ind[,2]]

And this is how our matrix after replacing NAs with column means look like.

data_mat

##       [,1] [,2]     [,3]  [,4]     [,5]
##  [1,] 4.00    2 3.857143 3.000 2.000000
##  [2,] 4.00    2 3.857143 3.625 5.000000
##  [3,] 2.75    3 4.000000 2.000 1.000000
##  [4,] 1.00    5 3.857143 2.000 2.888889
##  [5,] 1.00    2 4.000000 3.625 3.000000
##  [6,] 4.00    3 2.000000 3.000 5.000000
##  [7,] 2.00    2 4.000000 5.000 1.000000
##  [8,] 2.75    5 5.000000 4.000 5.000000
##  [9,] 1.00    4 4.000000 5.000 3.000000
## [10,] 5.00    2 4.000000 5.000 1.000000

How to replace NAs with column means in a dataframe in R?

We can follow exactly the same steps to replace missing values in a data frame with column mean values. For the sake of completeness, we compute column means using colMeans() function on the dataframe.

col_means <- colMeans(data_df, na.rm=TRUE)
col_means
##       V1       V2       V3       V4       V5 
## 2.750000 3.000000 3.857143 3.625000 2.888889

And then find the locations of missing values using which() function on the data frame.

na_ind <- which(is.na(data_df), arr.ind=TRUE)
na_ind
##       row col
##  [1,]   3   1
##  [2,]   8   1
##  [3,]   3   2
##  [4,]   1   3
##  [5,]   2   3
##  [6,]   4   3
##  [7,]   2   4
##  [8,]   5   4
##  [9,]   4   5

Then we substitute mean of column values at the NA locations.

data_df[na_ind] <- col_means[na_ind[,2]]

Our data frame after replacing NAs with column means look like this.

data_df
##      V1 V2       V3    V4       V5
## 1  4.00  2 3.857143 3.000 2.000000
## 2  4.00  2 3.857143 3.625 5.000000
## 3  2.75  3 4.000000 2.000 1.000000
## 4  1.00  5 3.857143 2.000 2.888889
## 5  1.00  2 4.000000 3.625 3.000000
## 6  4.00  3 2.000000 3.000 5.000000
## 7  2.00  2 4.000000 5.000 1.000000
## 8  2.75  5 5.000000 4.000 5.000000
## 9  1.00  4 4.000000 5.000 3.000000
## 10 5.00  2 4.000000 5.000 1.000000

Love tidyverse?, you might want to see this post on replacing NAs with column means using tidyverse approach.

Have not timed it, but for much larger dataframes, tidyverse approach might be slower than the base R approach.

Exit mobile version