In this tutorial, we will learn how to use unite() function in tidyr package to combine multiple columns into a single column. By combining , we mean to concatenate the values of two or more columns separated by a delimiter like underscore. We will start with combining two columns into one column using three examples. And then we will show an example of combining more than two columns into a single column with two examples.
Let us first load the packages needed.
library(tidyvrerse) library(palmerpenguins) packageVersion("tidyr") ## [1] '1.2.0'
We will be. using palmer penguin data set to show how to use unite() function.
penguins %>% head() ## # A tibble: 6 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> ## 1 Adelie Torge… 39.1 18.7 181 3750 male ## 2 Adelie Torge… 39.5 17.4 186 3800 fema… ## 3 Adelie Torge… 40.3 18 195 3250 fema… ## 4 Adelie Torge… NA NA NA NA <NA> ## 5 Adelie Torge… 36.7 19.3 193 3450 fema… ## 6 Adelie Torge… 39.3 20.6 190 3650 male ## # … with 1 more variable: year <int>
Instead of using all the columns in penguins dataset, we select four columns for illustrating tidyr’s unite(). function combine multiple columns into a single column.
Below we create a new dataframe with just four columns.
df <- penguins %>% select(species, island, sex, body_mass_g) df %>% head() ## # A tibble: 6 × 4 ## species island sex body_mass_g ## <fct> <fct> <fct> <int> ## 1 Adelie Torgersen male 3750 ## 2 Adelie Torgersen female 3800 ## 3 Adelie Torgersen female 3250 ## 4 Adelie Torgersen <NA> NA ## 5 Adelie Torgersen female 3450 ## 6 Adelie Torgersen male 3650
Combine two columns into one column with tidyr’s unite().
To combine two columns into. a single column, we specify the new column name that will contain the combined columns first. And then specify the names of the columns to be combined.
In the example below, we are combining species and island columns and call the combined column as ‘species_island’.
df %>% unite(col="species_island", c(species, island)) %>% head()
We get a new dataframe with the new combined column. Note that by default unite() has used underscore to “_” combine the values of columns. And also by default, unite() removes the original two columns that we combined.
## # A tibble: 6 × 3 ## species_island sex body_mass_g ## <chr> <fct> <int> ## 1 Adelie_Torgersen male 3750 ## 2 Adelie_Torgersen female 3800 ## 3 Adelie_Torgersen female 3250 ## 4 Adelie_Torgersen <NA> NA ## 5 Adelie_Torgersen female 3450 ## 6 Adelie_Torgersen male 3650
Keep original columns while combining two columns into a single column
We can keep the original two columns we are combining using the argument ‘remove=FALSE’ with unite(). Now unite() function will not remove the original two columns.
df %>% unite(col="species_island", c(species, island), remove=FALSE) %>% head() # # A tibble: 6 × 5 ## species_island species island sex body_mass_g ## <chr> <fct> <fct> <fct> <int> ## 1 Adelie_Torgersen Adelie Torgersen male 3750 ## 2 Adelie_Torgersen Adelie Torgersen female 3800 ## 3 Adelie_Torgersen Adelie Torgersen female 3250 ## 4 Adelie_Torgersen Adelie Torgersen <NA> NA ## 5 Adelie_Torgersen Adelie Torgersen female 3450 ## 6 Adelie_Torgersen Adelie Torgersen male 3650
Combine two columns into a single column using a specific delimiter
We can also specify the delimiter to combine two columns instead of the default under score. In the example below we combine two columns with unite() function but with delimiter “–” set by sep=”–” argument.
df %>% unite(col="species_island", c(species, island), sep="--") %>% head() ## # A tibble: 6 × 3 ## species_island sex body_mass_g ## <chr> <fct> <int> ## 1 Adelie--Torgersen male 3750 ## 2 Adelie--Torgersen female 3800 ## 3 Adelie--Torgersen female 3250 ## 4 Adelie--Torgersen <NA> NA ## 5 Adelie--Torgersen female 3450 ## 6 Adelie--Torgersen male 3650
Combine more than two columns in to a single column
To. combine multiple columns, here 3 columns, we just need to specify the tidy select method to specify the columns. In the example below we give the three column names as a vector.
df %>% unite(col="species_island_sex", c(species, island, sex)) ## # A tibble: 344 × 2 ## species_island_sex body_mass_g ## <chr> <int> ## 1 Adelie_Torgersen_male 3750 ## 2 Adelie_Torgersen_female 3800 ## 3 Adelie_Torgersen_female 3250 ## 4 Adelie_Torgersen_NA NA ## 5 Adelie_Torgersen_female 3450 ## 6 Adelie_Torgersen_male 3650 ## 7 Adelie_Torgersen_female 3625 ## 8 Adelie_Torgersen_male 4675 ## 9 Adelie_Torgersen_NA 3475 ## 10 Adelie_Torgersen_NA 4250 ## # … with 334 more rows
Dealing with missing values while Combining multiple columns in to a single column
If you see the output of previous example of combining three columns into one, we can see that NAs in the third column is represened as “NA” in the combined column. By default, unite() function does not remove NAs.
We can remove any missing value NA in one of the columns we are combining by specifying na.rm=TRUE. Here is an example of removing NAs.
Note that the fourth value of combined column does not have NA.
df %>% unite(col="species_island_sex", c(species, island, sex), na.rm=TRUE) %>% head() ## # A tibble: 6 × 2 ## species_island_sex body_mass_g ## <chr> <int> ## 1 Adelie_Torgersen_male 3750 ## 2 Adelie_Torgersen_female 3800 ## 3 Adelie_Torgersen_female 3250 ## 4 Adelie_Torgersen NA ## 5 Adelie_Torgersen_female 3450 ## 6 Adelie_Torgersen_male 3650