In this post, we will learn how to drop unused level or levels of a factor variable in R. Sometimes, we may end up with a factor variable with un used levels after some data munging. Unused factor levels can sometime create issues while analyzing the data.
In this tutorial, we will show how to drop unused levels of a factor variable using two approaches: one using droplevels() function available in base R and the second using fct_drop() from forcats R package in tidyverse.
Let us load the packages needed.
library(tidyverse) library(palmerpenguins)
We will use Palmer Penguins data to show how to drop levels of a factor variable.
penguins |>head() # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650
Levels of a factor variable
In R, factor datatype is a useful way to represent categorical data and it’s values can take unique levels or categories. We can use levels() function to find how many distinct values/levels are there in a factor variable. In the example below, the factor variable, species, has three levels or values.
levels(penguins$species) [1] "Adelie" "Chinstrap" "Gentoo"
A thing to notice about factor variable is that even after removing all the values to a specific level, the factor variable will still include the removed as one of its levels.
Let us see an example, by filtering out one of the levels from a factor variable. In the example below, we have remove data for the species “Gentoo” to create a new dataframe.
df <- penguins |> filter(species != "Gentoo")
If we check the levels of the species variable in the new dataframe, it will still include the removed level. And this may cause problems while analyzing the new dataframe.
levels(df$species) [1] "Adelie" "Chinstrap" "Gentoo"
droplevels() to Remove unused levels of a factor variable
One of the ways to remove the unused levels in a factor variable is to use droplevels() function available in base R.
We remove the unused levels using droplevels() function and then re-assign it as our new factor variable.
df$species <- droplevels(df$species)
Now, if we check the levels of the factor varible, we will correctly see that we have removed the unused levels.
levels(df$species) [1] "Adelie" "Chinstrap"
foccats’ fct_drop() to Remove unused levels of a factor variable
Another way to remove unused levels of a factor variable is to use tidyverse’ fct_drop() function from forcats R package.
Let us filter out the rows corresponding to one of the levels as before.
df2 <- penguins |> filter(species != "Chinstrap")
We can see that the dataframe has the unused levels.
levels(df2$species) [1] "Adelie" "Chinstrap" "Gentoo"
We can use fct_drop() to drop unused levels in the factor variable and update the factor variable using mutate() function.
df2 <- df2 |> mutate(species = fct_drop(species))
If we check the levels used now, it will show the levels that are used in the new dataframe as we wanted.
levels(df2$species) [1] "Adelie" "Gentoo"