In this tutorial we will learn how to select one or more columns/variables from a data frame in R. We first learn how to select the columns of interest with dplyr’s select() function by using their name and then we will learn how to select columns using base R approach.
Let us get started by loading tidyverse suit of R packages including dplyr.
library(tidyverse)
We will use iris dataset built in with R to illustrate selecting one or more columns using tidyverse and base R.
iris %>% head() ## # A tibble: 6 × 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
Select a single column using dplyr select()
dplyr’s select() function is one of the main functions/verbs of dplyr package that helps us selecting one or more columns easily under many situations.
Here we show three equivalent ways we can use select() function to select a column from a dataframe. First, we use %>% operator from magrittr in tidyverse.
iris %>% select(Sepal.Length) ## # A tibble: 150 × 1 ## Sepal.Length ## <dbl> ## 1 5.1 ## 2 4.9 ## 3 4.7 ## 4 4.6 ## 5 5 ## 6 5.4 ## 7 4.6 ## 8 5 ## 9 4.4 ## 10 4.9 ## # … with 140 more rows
Starting from R version 4, we have support for native pipe operator in base R, “|>“. Here is an example using the native pipe operator to use select function.
iris |> select(Sepal.Length) ## # A tibble: 150 × 1 ## Sepal.Length ## <dbl> ## 1 5.1 ## 2 4.9 ## 3 4.7 ## 4 4.6 ## 5 5 ## 6 5.4 ## 7 4.6 ## 8 5 ## 9 4.4 ## 10 4.9 ## # … with 140 more rows
We can also use select() function without any of the pipe operators as shown below and get the same results
select(iris, Sepal.Length) ## # A tibble: 150 × 1 ## Sepal.Length ## <dbl> ## 1 5.1 ## 2 4.9 ## 3 4.7 ## 4 4.6 ## 5 5 ## 6 5.4 ## 7 4.6 ## 8 5 ## 9 4.4 ## 10 4.9
How to select a column using base R
If you want to select a column from dataframe using base R functions, one way is to use $ sign with the column name on the dataframe. However, this would give us the column values as a vector instead of a dataframe.
iris$Sepal.Length
If we want to select a column as a dataframe, we will slice it using square bracket notation with column name we want to select as shown below.
iris[,"Sepal.Length"] ## # A tibble: 150 × 1 ## Sepal.Length ## <dbl> ## 1 5.1 ## 2 4.9 ## 3 4.7 ## 4 4.6 ## 5 5 ## 6 5.4 ## 7 4.6 ## 8 5 ## 9 4.4 ## 10 4.9 ## # … with 140 more rows
How to Select Multiple Columns using select()
For selecting multiple columns from dataframe using dplyr’ select() function, we use the multiple column names as argument to select() function. Here are three equivalent ways to select multiple columns using dplyr’s select() function.
First, we use pipe operator %>% from magrittr from tidyverse.
iris %>% select(Sepal.Length, Species) ## # A tibble: 150 × 2 ## Sepal.Length Species ## <dbl> <fct> ## 1 5.1 setosa ## 2 4.9 setosa ## 3 4.7 setosa ## 4 4.6 setosa ## 5 5 setosa ## 6 5.4 setosa ## 7 4.6 setosa ## 8 5 setosa ## 9 4.4 setosa ## 10 4.9 setosa ## # … with 140 more rows
Here we select multiple columns using native pipe operator |> available in R (from version 4).
iris |> select(Sepal.Length, Species) ## # A tibble: 150 × 2 ## Sepal.Length Species ## <dbl> <fct> ## 1 5.1 setosa ## 2 4.9 setosa ## 3 4.7 setosa ## 4 4.6 setosa ## 5 5 setosa ## 6 5.4 setosa ## 7 4.6 setosa ## 8 5 setosa ## 9 4.4 setosa ## 10 4.9 setosa ## # … with 140 more rows
And here is how to select multiple columns without using a pipe operator, but using select() function in dplyr.
select(iris, Sepal.Length, Species) ## # A tibble: 150 × 2 ## Sepal.Length Species ## <dbl> <fct> ## 1 5.1 setosa ## 2 4.9 setosa ## 3 4.7 setosa ## 4 4.6 setosa ## 5 5 setosa ## 6 5.4 setosa ## 7 4.6 setosa ## 8 5 setosa ## 9 4.4 setosa ## 10 4.9 setosa ## # … with 140 more rows
How to select multiple columns from a dataframe in base R
To select multiple columns from dataframe using base R, we use square bracket notation and give the columns of interest as vector.
iris[,c("Sepal.Length", "Species")] ## # A tibble: 150 × 2 ## Sepal.Length Species ## <dbl> <fct> ## 1 5.1 setosa ## 2 4.9 setosa ## 3 4.7 setosa ## 4 4.6 setosa ## 5 5 setosa ## 6 5.4 setosa ## 7 4.6 setosa ## 8 5 setosa ## 9 4.4 setosa ## 10 4.9 setosa ## # … with 140 more rows
We can also save the columns of interest in a vector variable and use the variable to select the columns in base R.
columns_of_interest <- c("Sepal.Length", "Species") iris[, columns_of_interest] ## # A tibble: 150 × 2 ## Sepal.Length Species ## <dbl> <fct> ## 1 5.1 setosa ## 2 4.9 setosa ## 3 4.7 setosa ## 4 4.6 setosa ## 5 5 setosa ## 6 5.4 setosa ## 7 4.6 setosa ## 8 5 setosa ## 9 4.4 setosa ## 10 4.9 setosa ## # … with 140 more rows