• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Rstats 101

Learn R Programming Tips & Tricks for Statistics and Data Science

  • Home
  • About
    • Privacy Policy
  • Show Search
Hide Search

How to Replace NA values with Column Mean

rstats101 · January 14, 2022 ·

One of the ways to deal with missing values (or NAs) is to replace them with mean values (under certain assumptions). The process is called imputation. In this tutorial, we will learn how to replace missing values (NAs) with column mean. First we will create some data with missing values and then show how to replace NAs with column mean values in a matrix. And then we will show to how to replace NAs with mean column values in a data frame.

Create a Matrix/Dataframe with NAs

Let us use sample() function to create some data with missing values – NAs.

set.seed(2020)
data <- sample(c(1:5,NA), 50, replace = TRUE)
data
##  [1]  4  4 NA  1  1  4  2 NA  1  5  2  2 NA  5  2  3  2  5  4  2 NA NA  4 NA  4
## [26]  2  4  5  4  4  3 NA  2  2 NA  3  5  4  5  5  2  5  1 NA  3  5  1  5  3  1

First, we will create a matrix using the data with missing values.

data_mat <- matrix(data, ncol=5)
head(data_mat)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    2   NA    3    2
## [2,]    4    2   NA   NA    5
## [3,]   NA   NA    4    2    1
## [4,]    1    5   NA    2   NA
## [5,]    1    2    4   NA    3
## [6,]    4    3    2    3    5

Then, we will create a dataframe using the data with missing values.

data_df<- as.data.frame(data_mat)

head(data_df)
##   V1 V2 V3 V4 V5
## 1  4  2 NA  3  2
## 2  4  2 NA NA  5
## 3 NA NA  4  2  1
## 4  1  5 NA  2 NA
## 5  1  2  4 NA  3
## 6  4  3  2  3  5

How to replace NAs with column means in a matrix in R?

Now let us see how to impute NAs with column mean in a matrix. We need means of all the columns in the matrix. We can use colMeans() function compute column means.

col_means <- colMeans(data_mat, na.rm=TRUE)
col_means
## [1] 2.750000 3.000000 3.857143 3.625000 2.888889

We will be use the column mean value sto substitute it for NAs in the data matrix. Let us find the locations of NAs in the matrix. We can use which() function with arr.ind=TRUE to get row and column positions of missing values.

na_ind <- which(is.na(data_mat), arr.ind=TRUE)
na_ind

##       row col
##  [1,]   3   1
##  [2,]   8   1
##  [3,]   3   2
##  [4,]   1   3
##  [5,]   2   3
##  [6,]   4   3
##  [7,]   2   4
##  [8,]   5   4
##  [9,]   4   5

Now we can simply replace column mean values at these NA locations.

data_mat[na_ind] <- col_means[na_ind[,2]]

And this is how our matrix after replacing NAs with column means look like.

data_mat

##       [,1] [,2]     [,3]  [,4]     [,5]
##  [1,] 4.00    2 3.857143 3.000 2.000000
##  [2,] 4.00    2 3.857143 3.625 5.000000
##  [3,] 2.75    3 4.000000 2.000 1.000000
##  [4,] 1.00    5 3.857143 2.000 2.888889
##  [5,] 1.00    2 4.000000 3.625 3.000000
##  [6,] 4.00    3 2.000000 3.000 5.000000
##  [7,] 2.00    2 4.000000 5.000 1.000000
##  [8,] 2.75    5 5.000000 4.000 5.000000
##  [9,] 1.00    4 4.000000 5.000 3.000000
## [10,] 5.00    2 4.000000 5.000 1.000000

How to replace NAs with column means in a dataframe in R?

We can follow exactly the same steps to replace missing values in a data frame with column mean values. For the sake of completeness, we compute column means using colMeans() function on the dataframe.

col_means <- colMeans(data_df, na.rm=TRUE)
col_means
##       V1       V2       V3       V4       V5 
## 2.750000 3.000000 3.857143 3.625000 2.888889

And then find the locations of missing values using which() function on the data frame.

na_ind <- which(is.na(data_df), arr.ind=TRUE)
na_ind
##       row col
##  [1,]   3   1
##  [2,]   8   1
##  [3,]   3   2
##  [4,]   1   3
##  [5,]   2   3
##  [6,]   4   3
##  [7,]   2   4
##  [8,]   5   4
##  [9,]   4   5

Then we substitute mean of column values at the NA locations.

data_df[na_ind] <- col_means[na_ind[,2]]

Our data frame after replacing NAs with column means look like this.

data_df
##      V1 V2       V3    V4       V5
## 1  4.00  2 3.857143 3.000 2.000000
## 2  4.00  2 3.857143 3.625 5.000000
## 3  2.75  3 4.000000 2.000 1.000000
## 4  1.00  5 3.857143 2.000 2.888889
## 5  1.00  2 4.000000 3.625 3.000000
## 6  4.00  3 2.000000 3.000 5.000000
## 7  2.00  2 4.000000 5.000 1.000000
## 8  2.75  5 5.000000 4.000 5.000000
## 9  1.00  4 4.000000 5.000 3.000000
## 10 5.00  2 4.000000 5.000 1.000000

Love tidyverse?, you might want to see this post on replacing NAs with column means using tidyverse approach.

  • How to Replace NAs with Column mean using tidyverse

Have not timed it, but for much larger dataframes, tidyverse approach might be slower than the base R approach.

Related

Filed Under: impute with mean values, replace NA Tagged With: Replace NAs with column mean

Reader Interactions

Trackbacks

  1. How to Replace NAs with Column mean using tidyverse - Rstats101.com says:
    January 20, 2022 at 10:43 am

    […] to do with them. How to Replace NAs with column means using tidyverse In earlier post, we learned how to replace NAs with column means using base R functions. Here we use tidyverse functions to do the […]

Primary Sidebar

Recent Posts

  • How to create a nested dataframe with lists
  • How to compute proportion with tidyverse
  • How to Compute Z-Score of Multiple Columns
  • How to drop unused level of factor variable in R
  • How to compute Z-score

Categories

%in% arrange() as.data.frame as_tibble built-in data R colSums() R cor() in R data.frame dplyr dplyr across() dplyr group_by() dplyr rename() dplyr rowwise() dplyr row_number() dplyr select() dplyr slice_max() dplyr slice_sample() drop_na R duplicated() gsub head() impute with mean values is.element() linear regression matrix() function na.omit R NAs in R near() R openxlsx pivot_longer() prod() R.version replace NA replace NAs tidyverse R Function rstats rstats101 R version scale() sessionInfo() t.test() tidyr tidyselect tidyverse write.xlsx

Copyright © 2025 · Daily Dish Pro on Genesis Framework · WordPress · Log in

Go to mobile version