In this tutorial, we will learn about colSums() function in base R and use it to calculate sum of all columns in a matrix or a dataframe. We will see two examples to understand the use colSums() function. First, we will calculate sum of all columns in a matrix and dataframe with no missing values (NAs). Next, we will learn how to compute sum of all columns when the matrix/dataframe has missing values.
Create a matrix and dataframe from scratch
Let us create a matrix and dataframe from scratch using random numbers generated using sample() function. First we create a vector of numbers.
set.seed(42) data <- sample(c(1:6), 50, replace = TRUE)
data ## [1] 1 5 1 1 2 4 2 2 1 4 1 5 6 4 2 2 3 1 1 3 4 5 5 5 4 2 4 3 2 1 2 6 3 6 2 4 4 6 ## [39] 2 5 4 5 4 2 2 3 1 5 2 2
And then we use matrix() function to create a matrix.
data_mat <- matrix(data, ncol=5)
Finally, we use as.data.frame() function to create a dataframe.
data_df<- as.data.frame(data_mat)
Sum of columns of a matrix
Let us compute the sum of all the columns using colSums() on the matrix. Our data matrix is complete with no missing data.
head(data_mat) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 1 4 2 4 ## [2,] 5 5 5 6 5 ## [3,] 1 6 5 3 4 ## [4,] 1 4 5 6 2 ## [5,] 2 2 4 2 2 ## [6,] 4 2 2 4 3
Applying colSums() on the matrix we get the sum of each column as a vector.
colSums(data_mat) ## [1] 23 28 35 40 30
Sum of columns of a dataframe
We can also use colSums() function to calculate sum of all columns in a dataframe. The dataframe should not have any non-numerical columns.
head(data_df) ## V1 V2 V3 V4 V5 ## 1 1 1 4 2 4 ## 2 5 5 5 6 5 ## 3 1 6 5 3 4 ## 4 1 4 5 6 2 ## 5 2 2 4 2 2 ## 6 4 2 2 4 3
In our sample datafram all the columns are numerical. We get the sum of all columns in the dataframe.
colSums(data_df) ## V1 V2 V3 V4 V5 ## 23 28 35 40 30
How to calculate Sum of columns of a matrix with missing data (NAs)
First, let create a matrix and dataframe with missing values.
data <- sample(c(1:5, NA), 50, replace = TRUE) data_mat <- matrix(data, ncol=5) data_df<- as.data.frame(data_mat)
In this example, the data matrix has missing values (NAs) in all columns except the second column the first and fourth columns.
head(data_mat) ## [,1] [,2] [,3] [,4] [,5] ## [1,] NA 2 4 NA 4 ## [2,] NA 5 1 2 2 ## [3,] 2 1 3 2 2 ## [4,] 4 1 3 1 3 ## [5,] 3 4 5 2 5 ## [6,] NA 5 5 5 5
So when we apply colSums() on the data matrix, it computes the sum on the columns where there is no missing values. For columns containing missing values we get NAs. This because, colSums() function has argument na.rm=FALSE by default.
colSums(data_mat) ## [1] NA 30 NA NA NA
With na.rm=TRUE argument, colSums() function will calculate sum after ignoring the missing values.
colSums(data_mat, na.rm=TRUE) ## [1] 18 30 34 22 28
How to calculate Sum of columns of a dataframe with missing data (NAs)
head(data_df) ## V1 V2 V3 V4 V5 ## 1 NA 2 4 NA 4 ## 2 NA 5 1 2 2 ## 3 2 1 3 2 2 ## 4 4 1 3 1 3 ## 5 3 4 5 2 5 ## 6 NA 5 5 5 5
When there is missing values, colSums() returns NAs for dataframes as well by default.
colSums(data_df) ## V1 V2 V3 V4 V5 ## NA 30 NA NA NA
We can use na.rm =TRUE argument to compute sum of all columns with missing values. And we would get sums ignoring the missing values in the dataframe columns.
colSums(data_df, na.rm=TRUE) ## V1 V2 V3 V4 V5 ## 18 30 34 22 28