dplyr rows_update(): Modify existing rows

In this post, we will learn how to use dplyr’s rows_update() function with examples. dplyr’s rows_update() function is a useful function to modify or update specific rows in a data frame based on a matching values in key column. It takes in two dataframes x and y, updates existing rows in the target data frame x with values from the y data frame where the keys (columns) match.

library(tidyverse)
packageVersion("dplyr")

[1] '1.1.4'

Basic syntax of dplyr’s rows_update() function looks as given below. It takes in two dataframes as input with optional arguments.

rows_update(
  x,
  y,
  by = NULL,
  ...,
  unmatched = c("error", "ignore"),
  copy = FALSE,
  in_place = FALSE
)

Let us create two small dataframes.The first dataframe is our target dataframe in which we want to update some rows.

# Original data frame
df1 <- tibble(
  student_id = 1:5,
  name = c("Alice", "Bob", "Charlie", "Liz", "Sam"),
  score = c(85, 90, 88, 92, 89)
)

df1

# A tibble: 5 × 3
  student_id name    score
       <int> <chr>   <dbl>
1          1 Alice      85
2          2 Bob        90
3          3 Charlie    88
4          4 Liz        92
5          5 Sam        89

And the second dataframe contains the new values that needed to be updated with and a key column specifying the variable that we want to update.

# Data frame with updated scores for some student_ids
df2 <- tibble(
  student_id = c(1, 5),
  score = c(100, 98)  # updated scores for Alice and Sam
)

df2

# A tibble: 2 × 2
  student_id score
       <dbl> <dbl>
1          1   100
2          5    98

In the example we are considering here we are interested in updates the scores of two student ids. Originally the students 1 & 5 had 85 & 89 as their scores (first dataframe). Now we want to update them to 100 & 98 (second dataframe).

df1 |>
  rows_update(df2)

Matching, by = "student_id"
# A tibble: 5 × 3
  student_id name    score
       <int> <chr>   <dbl>
1          1 Alice     100
2          2 Bob        90
3          3 Charlie    88
4          4 Liz        92
5          5 Sam        98

dplyr’s row_update() identifies the column name to merge automatically, but we can also specify using “by” argument as shown below.

df1 |>
  rows_update(df2, by="student_id")

# A tibble: 5 × 3
  student_id name    score
       <int> <chr>   <dbl>
1          1 Alice     100
2          2 Bob        90
3          3 Charlie    88
4          4 Liz        92
5          5 Sam        98

]

Note that the second dataframe y must have the same columns of x or a subset. If there is a column that is not present in the first dataframe, rows_update() will throw an error. Here is an example showing that.

# Data frame with updated scores for some student_ids
df2_new <- tibble(
  student_id = c(1, 5),
  score = c(100, 98),
  grade = c("A", "A") # extra column not present in the first 
)

df2

# A tibble: 2 × 2
  student_id score
       <dbl> <dbl>
1          1   100
2          5    98

df1 |>
  rows_update(df2_new, by="student_id")

Error in `rows_update()`:
! All columns in `y` must exist in `x`.
ℹ The following columns only exist in `y`: `grade`.
Backtrace:
 1. dplyr::rows_update(df1, df2_new, by = "student_id")
 2. dplyr:::rows_update.data.frame(df1, df2_new, by = "student_id")
Error in rows_update(df1, df2_new, by = "student_id") : 
ℹ The following columns only exist in `y`: `grade`.

Related