In this post, we will learn how to use dplyr’s rows_update() function with examples. dplyr’s rows_update() function is a useful function to modify or update specific rows in a data frame based on a matching values in key column. It takes in two dataframes x and y, updates existing rows in the target data frame x with values from the y data frame where the keys (columns) match.
library(tidyverse) packageVersion("dplyr") [1] '1.1.4'
Basic syntax of dplyr’s rows_update() function looks as given below. It takes in two dataframes as input with optional arguments.
rows_update( x, y, by = NULL, ..., unmatched = c("error", "ignore"), copy = FALSE, in_place = FALSE )
Let us create two small dataframes.The first dataframe is our target dataframe in which we want to update some rows.
# Original data frame df1 <- tibble( student_id = 1:5, name = c("Alice", "Bob", "Charlie", "Liz", "Sam"), score = c(85, 90, 88, 92, 89) ) df1 # A tibble: 5 × 3 student_id name score <int> <chr> <dbl> 1 1 Alice 85 2 2 Bob 90 3 3 Charlie 88 4 4 Liz 92 5 5 Sam 89
And the second dataframe contains the new values that needed to be updated with and a key column specifying the variable that we want to update.
# Data frame with updated scores for some student_ids df2 <- tibble( student_id = c(1, 5), score = c(100, 98) # updated scores for Alice and Sam ) df2 # A tibble: 2 × 2 student_id score <dbl> <dbl> 1 1 100 2 5 98
In the example we are considering here we are interested in updates the scores of two student ids. Originally the students 1 & 5 had 85 & 89 as their scores (first dataframe). Now we want to update them to 100 & 98 (second dataframe).
df1 |> rows_update(df2) Matching, by = "student_id" # A tibble: 5 × 3 student_id name score <int> <chr> <dbl> 1 1 Alice 100 2 2 Bob 90 3 3 Charlie 88 4 4 Liz 92 5 5 Sam 98
dplyr’s row_update() identifies the column name to merge automatically, but we can also specify using “by” argument as shown below.
df1 |> rows_update(df2, by="student_id") # A tibble: 5 × 3 student_id name score <int> <chr> <dbl> 1 1 Alice 100 2 2 Bob 90 3 3 Charlie 88 4 4 Liz 92 5 5 Sam 98
]
Note that the second dataframe y must have the same columns of x or a subset. If there is a column that is not present in the first dataframe, rows_update() will throw an error. Here is an example showing that.
# Data frame with updated scores for some student_ids df2_new <- tibble( student_id = c(1, 5), score = c(100, 98), grade = c("A", "A") # extra column not present in the first ) df2 # A tibble: 2 × 2 student_id score <dbl> <dbl> 1 1 100 2 5 98
df1 |> rows_update(df2_new, by="student_id")
Error in `rows_update()`: ! All columns in `y` must exist in `x`. ℹ The following columns only exist in `y`: `grade`. Backtrace: 1. dplyr::rows_update(df1, df2_new, by = "student_id") 2. dplyr:::rows_update.data.frame(df1, df2_new, by = "student_id") Error in rows_update(df1, df2_new, by = "student_id") : ℹ The following columns only exist in `y`: `grade`.