dplyr case_when() to create new variable using multiple conditions

In this tutorial, we will learn how to use dplyr’s case_when() function to create a new variable based on multiple conditions. dplyr’s case_when() function offers a genereal solution when you might needed multiple if_else() conditions.

In this blog post, we’ll explore case_when() function with multiple examples. Let us start with creating a simple data frame with one numerical column. We will create a new variable using the numerical variable with multiple conditions.

First, let us load tidyverse and check our installed version of dplyr. In this example, we will use dplyr version 1.1.0. With dplyr’s v1.1.0 case_when() function has gained default option and we will be using that in this post.

library(tidyverse)
packageVersion("dplyr")

[1] '1.1.0'

Our toy data frame with one numerical variable, exam scores, looks like this.

df <-  tibble(score = seq(10,100, by=20))
df

# A tibble: 5 × 1
  score
  <dbl>
1    10
2    30
3    50
4    70
5    90

In the first example, we will be converting exam scores into a grade ranging from A to F based on the scores. For example, when a score is less than 35 we assign grade F, when a score is in between 35 and 50 we assign grade D.

The basic syntax to use case_when() is as shown in the example below.

df %>%
  mutate(grade = case_when(
    score < 35 ~ "F",
    score < 50 ~ "D",
    score < 70 ~ "C",
    score <= 80 ~ "B",
    score < 100 ~ "A",
    )
    )

This will give us a new dataframe with an additional column as shown below.

# A tibble: 5 × 2
  score grade
  <dbl> <chr>
1    10 F    
2    30 F    
3    50 C    
4    70 B    
5    90 A

dplyr case_when() example with default value

Sometime you may want to cover specific cases and for the rest you may want a default value. Here is a simple example using case_when() function with a default value. In the example below, we use .default argument to set default value for any condition that is not specified (We assign grade A to score above 80, be default).

df %>%
  mutate(grade = case_when(
    score <= 35 ~ "F",
    score < 50 ~ "D",
    score < 70 ~ "C",
    score <= 80 ~ "B",
    .default = "A"
    )
    )

# A tibble: 5 × 2
  score grade
  <dbl> <chr>
1    10 F    
2    30 F    
3    50 C    
4    70 B    
5    90 A

The above .default example is a bit pedantic as there is only one condition left in our example. Here is another simple example with .default argument. Here we set any score greater than 35 gets the default value specified by .default argument.

df %>%
  mutate(grade = case_when(
    score <= 35 ~ "Fail",
    .default = "Pass"
    )
    )

# A tibble: 5 × 2
  score grade
  <dbl> <chr>
1    10 Fail 
2    30 Fail 
3    50 Pass 
4    70 Pass 
5    90 Pass

dplyr’s case_when() example with missing values

In the example below, we will learn how to deal with missing values i.e. NAs, present in the variable of interest. First, we will create a dataframe with a column containing NAs.

df <-  tibble(score = c(seq(10,100, by=20), NA))
df

# A tibble: 6 × 1
  score
  <dbl>
1    10
2    30
3    50
4    70
5    90
6    NA

By ignoring NAs and if we use case_when() function to create a new variable, we might inadvertently make a mistake. In the example below, we have assigned PASS grade to element with NA for score.


df %>%
  mutate(grade = case_when(
     score <= 35 ~ "Fail",
    .default="Pass"
    )
    )

# A tibble: 6 × 2
  score grade
  <dbl> <chr>
1    10 Fail 
2    30 Fail 
3    50 Pass 
4    70 Pass 
5    90 Pass 
6    NA Pass

We can handle NA values, by adding a condition that checks for NA values using is.na() function and creating a specific value for NAs.

df %>%
  mutate(grade = case_when(
     score <= 35 ~ "Fail",
     is.na(score) ~ NA,
    .default="Pass"
    )
    )

# A tibble: 6 × 2
  score grade
  <dbl> <chr>
1    10 Fail 
2    30 Fail 
3    50 Pass 
4    70 Pass 
5    90 Pass 
6    NA <NA>