In this tutorial, we will learn how to randomly select rows from a dataframe using dplyr’s slice_sample() function in R. slice_sample() is the new way to randomly select rows either with replacement or without replacement and it supersedes an earlier function sample_n() in dplyr.
library(tidyverse) packageVersion("dplyr") [1] ‘1.0.7’
We will use storms dataset available with ggplot2 R package to illustrate the example of using slice_sample() function to randomly sample with examples.
storms %>% head() ## # A tibble: 6 × 13 ## name year month day hour lat long status category wind pressure ## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> ## 1 Amy 1975 6 27 0 27.5 -79 tropical de… -1 25 1013 ## 2 Amy 1975 6 27 6 28.5 -79 tropical de… -1 25 1013 ## 3 Amy 1975 6 27 12 29.5 -79 tropical de… -1 25 1013 ## 4 Amy 1975 6 27 18 30.5 -79 tropical de… -1 25 1013 ## 5 Amy 1975 6 28 0 31.5 -78.8 tropical de… -1 25 1012 ## 6 Amy 1975 6 28 6 32.4 -78.7 tropical de… -1 25 1012 ## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
Randomly select rows from dataframe in R
To randomly select a fixed number of rows, we use slice_sample() function with n argument to specify the number rows we would like to sample. In the example below we randomly sample 5 rows.
storms %>% slice_sample(n=5)
## # A tibble: 5 × 13 ## name year month day hour lat long status category wind pressure ## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> ## 1 Omar 2008 10 17 12 27.9 -55.7 hurricane 1 75 982 ## 2 Emily 1999 8 25 0 12.1 -53.9 tropical st… 0 45 1005 ## 3 Karl 1980 11 26 12 36.8 -42.5 hurricane 1 75 985 ## 4 Karl 2004 9 18 6 14.5 -38 hurricane 2 85 975 ## 5 Joan 1988 10 13 12 12.6 -56 tropical st… 0 40 1006 ## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
We can verify that slice_sample() is not returning the same samples by running slice_sample() again. Note the rows are different from previous run.
storms %>% slice_sample(n=5)
## # A tibble: 5 × 13 ## name year month day hour lat long status category wind pressure ## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> ## 1 Cora 1978 8 8 18 14 -43.2 hurricane 1 65 990 ## 2 Georges 1998 9 19 18 15.4 -53.5 hurricane 4 125 949 ## 3 AL041991 1991 8 24 18 14.9 -23.9 tropical… -1 30 1009 ## 4 Joan 1988 10 10 18 8.9 -42.2 tropical… -1 25 1010 ## 5 Ingrid 2007 9 15 12 16.3 -53.3 tropical… 0 35 1005 ## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
Reproduce a random sampling of rows from a dataframe
If you want to reproduce the same random sample of rows, we need to set seed to the random number generator with set.seed() function.
set.seed(42) storms %>% slice_sample(n=5)
## # A tibble: 5 × 13 ## name year month day hour lat long status category wind pressure ## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> ## 1 Cesar 1990 8 6 18 26.9 -46.6 tropical d… -1 30 1011 ## 2 Isaac 2000 9 24 18 15.8 -37.8 hurricane 3 100 960 ## 3 Nadine 2012 9 14 0 24.4 -53.4 tropical s… 0 60 989 ## 4 Klaus 1984 11 7 0 17.4 -66.2 tropical s… 0 40 1000 ## 5 Maria 2011 9 13 12 21.7 -67.7 tropical s… 0 45 1006 ## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
By using the same seed, we will get exactly the same rows sampled again.
set.seed(42) storms %>% slice_sample(n=5)
Randomly Sample Rows from a dataframe with Replacement
In the above examples of using slice_sample(), we sampled with out replacement, meaning that we were not sampling a row again if it is in your random samples.
We can sample with replacement using argument replace=TRUE to slice_sample() function. In the example below, we use the top 6 rows the storms data to illustrate the sampling with replacement. With such a small dataset we can easily see the rows sampled again.
set.seed(42) storms %>% head() %>% slice_sample(n=6, replace=TRUE)
We can see that the first, third and fourth rows exactly the same and it is the result sampling with replacement.
## # A tibble: 6 × 13 ## name year month day hour lat long status category wind pressure ## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> ## 1 Amy 1975 6 27 0 27.5 -79 tropical de… -1 25 1013 ## 2 Amy 1975 6 28 0 31.5 -78.8 tropical de… -1 25 1012 ## 3 Amy 1975 6 27 0 27.5 -79 tropical de… -1 25 1013 ## 4 Amy 1975 6 27 0 27.5 -79 tropical de… -1 25 1013 ## 5 Amy 1975 6 27 6 28.5 -79 tropical de… -1 25 1013 ## 6 Amy 1975 6 27 18 30.5 -79 tropical de… -1 25 1013 ## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
Weighted random sampling with slice_sample()
By default, slice_sample() function gives equal weight to each row while sampling. This means each row has the same weight to be sampled.
We can do weighted random sampling of rows based on the values of a column. This can assign different weight for a row to be sampled based their value.
In the example below we randomly sample a dataframe weighted by the “wind” column. Because of this, rows with higher wind values are more likely to be sampled than the rows with smaller wind values.
storms %>% slice_sample(n=6, weight_by = wind)
## # A tibble: 6 × 13 ## name year month day hour lat long status category wind pressure ## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> ## 1 Bonnie 1998 8 29 6 39.2 -69.6 tropical … 0 45 999 ## 2 Isaac 2000 9 28 6 23.8 -52 hurricane 3 105 955 ## 3 Harvey 1999 9 20 18 27 -85.5 tropical … 0 50 998 ## 4 Emily 2005 7 12 18 11 -52 tropical … 0 45 1004 ## 5 Erin 1989 8 22 12 28.8 -44.5 hurricane 1 65 986 ## 6 Allison 2001 6 5 21 28.9 -95.3 tropical … 0 45 1003 ## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>