In this tutorial, we will learn how to use dplyr’s case_when() function to create a new variable based on multiple conditions. dplyr’s case_when() function offers a genereal solution when you might needed multiple if_else() conditions.
In this blog post, we’ll explore case_when() function with multiple examples. Let us start with creating a simple data frame with one numerical column. We will create a new variable using the numerical variable using multiple conditions.
First, let us load tidyverse and check our installed version of dplyr. In this example we will use dplyr version 1.1.0. With dplyr’s v1.1.0 case_when() function has gained .default option and we will be using that in this post.
library(tidyverse) packageVersion("dplyr") [1] '1.1.0'
Our data frame with one numerical variable looks like this.
df <- tibble(score = seq(10,100, by=20)) df # A tibble: 5 × 1 score <dbl> 1 10 2 30 3 50 4 70 5 90
In the first example, we will be converting exam scores into a grade ranging from A to F based on the scores. For example, here when a score is less than 35 we assign grade F, when a score is in between 35 and 50 we assign. grade D. The basic syntax to use case_when() is as shown in the example below.
df %>% mutate(grade = case_when( score < 35 ~ "F", score < 50 ~ "D", score < 70 ~ "C", score <= 80 ~ "B", score < 100 ~ "A", ) ) # A tibble: 5 × 2 score grade <dbl> <chr> 1 10 F 2 30 F 3 50 C 4 70 B 5 90 A
dplyr case_when() example with default value
Here is a simple example using case_when() function with a default value. In the example below, we use .default to set default value for any condition that is not specified.
df %>% mutate(grade = case_when( score <= 35 ~ "F", score < 50 ~ "D", score < 70 ~ "C", score <= 80 ~ "B", .default = "A" ) ) # A tibble: 5 × 2 score grade <dbl> <chr> 1 10 F 2 30 F 3 50 C 4 70 B 5 90 A
The above .default example is a bit pedantic as there is only one condition left in our example. Here is another simple example with .default argument. Here we set any score greater than 35 gets the default value specified by .default argument.
df %>% mutate(grade = case_when( score <= 35 ~ "Fail", .default = "Pass" ) ) # A tibble: 5 × 2 score grade <dbl> <chr> 1 10 Fail 2 30 Fail 3 50 Pass 4 70 Pass 5 90 Pass
dplyr’s case_when() example with missing values
In the example below, we will learn how to deal with missing values i.e. NAs present in the variable of interest. First, we will create a dataframe with a column containing NAs.
df <- tibble(score = c(seq(10,100, by=20), NA)) df # A tibble: 6 × 1 score <dbl> 1 10 2 30 3 50 4 70 5 90 6 NA
By ignoring NAs and if we use case_when() function to create a new variable, we might inadvertently make a mistake. In the example below, we have assigned PASS grade to element with NA for score.
df %>% mutate(grade = case_when( score <= 35 ~ "Fail", .default="Pass" ) ) # A tibble: 6 × 2 score grade <dbl> <chr> 1 10 Fail 2 30 Fail 3 50 Pass 4 70 Pass 5 90 Pass 6 NA Pass
We can handle NA values, by adding a condition that checks for NA values using is.na() function and creating a specific value for NAs.
df %>% mutate(grade = case_when( score <= 35 ~ "Fail", is.na(score) ~ NA, .default="Pass" ) ) # A tibble: 6 × 2 score grade <dbl> <chr> 1 10 Fail 2 30 Fail 3 50 Pass 4 70 Pass 5 90 Pass 6 NA <NA>