Factors are a type of variable that R uses to store categorical information in R. For example, we can save sexes as factor variable with two values male and female. In this post we will learn about using factor variables in R and learn how to find out about levels of a factor variable in multiple ways. First we will use base R functions to learn about the levels and then we will see tidyverse functions to understand the levels of a factor.
First, let us get started loading the necessary packages. We use Palmer penguin datasets for understanding factor variables and levels in R.
library(tidyverse) library(palmerpenguins)
A quick glimpse of the penguins data show that it has three columns saved as factor variables.
penguins %>% head() ## # A tibble: 6 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> ## 1 Adelie Torge… 39.1 18.7 181 3750 male ## 2 Adelie Torge… 39.5 17.4 186 3800 fema… ## 3 Adelie Torge… 40.3 18 195 3250 fema… ## 4 Adelie Torge… NA NA NA NA <NA> ## 5 Adelie Torge… 36.7 19.3 193 3450 fema… ## 6 Adelie Torge… 39.3 20.6 190 3650 male ## # … with 1 more variable: year <int>
class() function to find the object of a variable
class() function in base R can tell what type of object is a given variable. In this example, if we use class() function on one of the factor variables, we get “factor”
class(penguins$sex) ## [1] "factor"
If we apply class() function on an integer variable, we get numeric as shown below.
class(penguins$bill_length_mm) ## [1] "numeric"
str() function to get levels of a factor variable
Although we use class() function to find out if a variable is factor or some other type, it does not tell us the levels of a factor, when the variable of interest is a factor. To get the levels of a factor, we can use str() function. str() function in base R can identify a variable and give us the levels of a factor variable and also the actual values of the factor variable.
str(penguins$sex) ## Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
levels() function to find the levels of a factor variable in R
levels() function applied to a factor variable gives us the levels of a factor variable. In the example below, levels() function on sex variable gives us its two values female and male in the dataframe.
levels(penguins$sex) ## [1] "female" "male"
We can also use pipe operator on the variable of interest and levels() to get the values of a factor.
penguins$sex %>% levels() ## [1] "female" "male"
Similarly, when we apply levels() on numerical variable we get NULL as levels of the variable.
levels(penguins$bill_depth_mm) ## NULL
Finding levels of factor with forcats::fct_count()
We can get the levels of a factor variable using tidyversd in a bit indirect way. The R package forcats, a part of tidyverse suite of R packages has two functions that can be handy in understanding factors and their levels.
First function in forcats useful for understanding factors is fct_count(). This function counts the number of items for each level of the factor variable. In addition to the levels, this function also gives us the counts of NAs, missing values.
penguins %>% pull(sex) %>% forcats::fct_count() ## # A tibble: 3 × 2 ## f n ## <fct> <int> ## 1 female 165 ## 2 male 168 ## 3 <NA> 11
fct_count() function expects the input to be a factor variable. If we give non-factor variable as input to fct_count() function we get the following error.
penguins %>% pull(bill_length_mm) %>% forcats::fct_count() Error: `f` must be a factor (or character vector).
Finding levels of factor with forcats::fct_unique()
fct_unque() is another function available in forcats package that can give us the factor levels. For example, in the example below we get female and male as the levels.
penguins %>% pull(sex) %>% forcats::fct_unique() ## [1] female male ## Levels: female male
When fct_unique() is applied to a numerical variable it does not throw any error.
penguins %>% pull(bill_length_mm) %>% forcats::fct_unique() ## factor(0) ## Levels: