In this tutorial, we will learn how to compute the sum of a column by groups in another column in a dataframe. Basic, idea is to group the dataframe by the grouping variable/column and then find the sum for each group.
Let us get started by loading tidyverse suite of R packages. We will be using dplyr’s group_by() function and summarize() functions to find the sum/total of a variable by group.
library(tidyverse)
We will create a simple dataframe with two columns, where one is a grouping variable and the other is numerical variable. We use tibble() to create the dataframe from scratch, mainly using sample() function to creating the two variables.
set.seed(41) df <- tibble(grp = sample(c("g1","g2"), 8, replace=TRUE), counts = sample(1:20,8) )
Our dataframe looks like this. Our grouping variable “grp” has two unique values and we will be computing the sum for both the values using group_by() and summarize() functions from dplyr.
df ## # A tibble: 8 × 2 ## grp counts ## <chr> <int> ## 1 g2 6 ## 2 g1 18 ## 3 g1 2 ## 4 g2 13 ## 5 g2 17 ## 6 g2 19 ## 7 g2 5 ## 8 g1 20
Computing sum of column in a dataframe based on a grouping column in R
dplyr’s group_by() function allows use to split the dataframe into smaller dataframes based on a variable of interest. The result after group_by() has all the elements of original dataframe, but with grouping information.
df %>% group_by(grp) ## # A tibble: 8 × 2 ## # Groups: grp [2] ## grp counts ## <chr> <int> ## 1 g2 6 ## 2 g1 18 ## 3 g1 2 ## 4 g2 13 ## 5 g2 17 ## 6 g2 19 ## 7 g2 5 ## 8 g1 20
Then, we can use summarize() function to compute the sum of each grouping variable.
df %>% group_by(grp) %>% summarize(total = sum(counts)) ## # A tibble: 2 × 2 ## grp total ## <chr> <int> ## 1 g1 40 ## 2 g2 60