How to randomly select rows from a dataframe in R

In this tutorial, we will learn how to randomly select rows from a dataframe using dplyr’s slice_sample() function in R. slice_sample() is the new way to randomly select rows either with replacement or without replacement and it supersedes an earlier function sample_n() in dplyr.

How to randomly select rows from a dataframe in R
dplyr::slice_sample() to randomly select rows from a dataframe

library(tidyverse)
packageVersion("dplyr")

[1] ‘1.0.7’

We will use storms dataset available with ggplot2 R package to illustrate the example of using slice_sample() function to randomly sample with examples.

storms %>%  
  head()

## # A tibble: 6 × 13
##   name   year month   day  hour   lat  long status       category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>        <ord>    <int>    <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropical de… -1          25     1013
## 2 Amy    1975     6    27     6  28.5 -79   tropical de… -1          25     1013
## 3 Amy    1975     6    27    12  29.5 -79   tropical de… -1          25     1013
## 4 Amy    1975     6    27    18  30.5 -79   tropical de… -1          25     1013
## 5 Amy    1975     6    28     0  31.5 -78.8 tropical de… -1          25     1012
## 6 Amy    1975     6    28     6  32.4 -78.7 tropical de… -1          25     1012
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

Randomly select rows from dataframe in R

To randomly select a fixed number of rows, we use slice_sample() function with n argument to specify the number rows we would like to sample. In the example below we randomly sample 5 rows.

storms %>% 
  slice_sample(n=5)
## # A tibble: 5 × 13
##   name   year month   day  hour   lat  long status       category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>        <ord>    <int>    <int>
## 1 Omar   2008    10    17    12  27.9 -55.7 hurricane    1           75      982
## 2 Emily  1999     8    25     0  12.1 -53.9 tropical st… 0           45     1005
## 3 Karl   1980    11    26    12  36.8 -42.5 hurricane    1           75      985
## 4 Karl   2004     9    18     6  14.5 -38   hurricane    2           85      975
## 5 Joan   1988    10    13    12  12.6 -56   tropical st… 0           40     1006
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

We can verify that slice_sample() is not returning the same samples by running slice_sample() again. Note the rows are different from previous run.

storms %>% 
  slice_sample(n=5)
## # A tibble: 5 × 13
##   name      year month   day  hour   lat  long status    category  wind pressure
##   <chr>    <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>     <ord>    <int>    <int>
## 1 Cora      1978     8     8    18  14   -43.2 hurricane 1           65      990
## 2 Georges   1998     9    19    18  15.4 -53.5 hurricane 4          125      949
## 3 AL041991  1991     8    24    18  14.9 -23.9 tropical… -1          30     1009
## 4 Joan      1988    10    10    18   8.9 -42.2 tropical… -1          25     1010
## 5 Ingrid    2007     9    15    12  16.3 -53.3 tropical… 0           35     1005
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

Reproduce a random sampling of rows from a dataframe

If you want to reproduce the same random sample of rows, we need to set seed to the random number generator with set.seed() function.

set.seed(42)
storms %>% 
  slice_sample(n=5)
## # A tibble: 5 × 13
##   name    year month   day  hour   lat  long status      category  wind pressure
##   <chr>  <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>       <ord>    <int>    <int>
## 1 Cesar   1990     8     6    18  26.9 -46.6 tropical d… -1          30     1011
## 2 Isaac   2000     9    24    18  15.8 -37.8 hurricane   3          100      960
## 3 Nadine  2012     9    14     0  24.4 -53.4 tropical s… 0           60      989
## 4 Klaus   1984    11     7     0  17.4 -66.2 tropical s… 0           40     1000
## 5 Maria   2011     9    13    12  21.7 -67.7 tropical s… 0           45     1006
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

By using the same seed, we will get exactly the same rows sampled again.

set.seed(42)
storms %>% 
  slice_sample(n=5)

Randomly Sample Rows from a dataframe with Replacement

In the above examples of using slice_sample(), we sampled with out replacement, meaning that we were not sampling a row again if it is in your random samples.

We can sample with replacement using argument replace=TRUE to slice_sample() function. In the example below, we use the top 6 rows the storms data to illustrate the sampling with replacement. With such a small dataset we can easily see the rows sampled again.

dplyr::slice_Sample() – How to randomly sample rows with replacement in R?

set.seed(42)
storms %>% 
  head() %>%
  slice_sample(n=6, replace=TRUE)

We can see that the first, third and fourth rows exactly the same and it is the result sampling with replacement.

## # A tibble: 6 × 13
##   name   year month   day  hour   lat  long status       category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>        <ord>    <int>    <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropical de… -1          25     1013
## 2 Amy    1975     6    28     0  31.5 -78.8 tropical de… -1          25     1012
## 3 Amy    1975     6    27     0  27.5 -79   tropical de… -1          25     1013
## 4 Amy    1975     6    27     0  27.5 -79   tropical de… -1          25     1013
## 5 Amy    1975     6    27     6  28.5 -79   tropical de… -1          25     1013
## 6 Amy    1975     6    27    18  30.5 -79   tropical de… -1          25     1013
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

Weighted random sampling with slice_sample()

By default, slice_sample() function gives equal weight to each row while sampling. This means each row has the same weight to be sampled.

We can do weighted random sampling of rows based on the values of a column. This can assign different weight for a row to be sampled based their value.

In the example below we randomly sample a dataframe weighted by the “wind” column. Because of this, rows with higher wind values are more likely to be sampled than the rows with smaller wind values.

storms %>% 
  slice_sample(n=6, weight_by = wind)
## # A tibble: 6 × 13
##   name     year month   day  hour   lat  long status     category  wind pressure
##   <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>      <ord>    <int>    <int>
## 1 Bonnie   1998     8    29     6  39.2 -69.6 tropical … 0           45      999
## 2 Isaac    2000     9    28     6  23.8 -52   hurricane  3          105      955
## 3 Harvey   1999     9    20    18  27   -85.5 tropical … 0           50      998
## 4 Emily    2005     7    12    18  11   -52   tropical … 0           45     1004
## 5 Erin     1989     8    22    12  28.8 -44.5 hurricane  1           65      986
## 6 Allison  2001     6     5    21  28.9 -95.3 tropical … 0           45     1003
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
Exit mobile version