T-test on real data using tidyverse

In this post, we will learn how to perform t-test on a real dataset and use tidyverse framework to access results from t-test. Check out the post how to do t-test to learn base R approach to t-tes.

We will use Palmer penguin dataset, a real world dataset, to show how to perform t-test. First, we will start with smaller version of penguin dataset and do t-test and then use all of penguin dataset and do t-test. This example shows the effect of sample size on the results from t-test.

Let us get started loading the packages needed.

library(tidyverse)
library(palmerpenguin)
library(broom)
theme_set(theme_bw(16)

First, let us sub-sample palmer penguin dataset to get a dataset of size 30. Here we are interested in understanding the relationship between sex and bill length from 30 penguins. Here 15 penguins are males and the remaining penguins are females. The question of interest is the mean bill length differ between males and females. We will use t-test to test if there is statistically significant difference between the mean values of bill length between males and females.

set.seed(1234)
df <- penguins |>
  drop_na() |>
  select(sex, bill_length_mm) |>
  group_by(sex) |>
  slice_sample(n=15) |>
  ungroup()

df

# A tibble: 30 × 2
   sex    bill_length_mm
   <fct>           <dbl>
 1 female           35.7
 2 female           45.5
 3 female           47.6
 4 female           43.8
 5 female           45.2
 6 female           46.6
 7 female           45.4
 8 female           46.7
 9 female           46.5
10 female           46.6
# ℹ 20 more rows

Since we have the data needed in a dataframe, we will use t.test() by providing the formula as argument to do the t-test.

T-test() with tidyverse

One way to use tidyverse framework to do t-test is as shown below.

df |>
  summarize(t_test_pval = t.test(bill_length_mm ~ sex)$p.value)

  t_test
   <dbl>
 0.239

If you are interested in saving the full t.test result object, here is a way with tidyverse to do the t-test. Here we first perform t-test and save the resulting object as a list column. And then use broom’s tidy() function to get the result as a dataframe with the help of map() function from purrr package.

library(broom)
df |>
  summarize(t_test_model = list(t.test(bill_length_mm ~ sex))) |>
  mutate(ttest_res = map(t_test_model, tidy)) |>
  unnest(ttest_res)

# A tibble: 1 × 11
  t_test  estimate estimate1 estimate2 statistic p.value parameter conf.low
  <list>     <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
1 <htest>    -2.01      44.6      46.7     -1.21   0.239      25.6    -5.45
# ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>

We can see that p.value from the t-test is 0.24, i.e. there is no statistically significant difference between the mean values bill length of males and females. Let us confirm that visually by making a boxplot of bill length for males and females.

df |>
  ggplot(aes(x=sex, y=bill_length_mm, fill=sex)) +
  geom_boxplot(outlier.shape = NA)+
  geom_jitter(width=0.1)+
  theme(legend.position = "none")

The boxplot shows that even though the mean values are slightly different, a huge variation in bill length mainly in male penguins gives the result we saw.

Applying t-test on real data with sample size of 15 in each group

T-test() using large sample sized data with tidyverse

Instead of applying t-test on 30 samples, let us perform t-test on all the samples in the Palmer penguin data. Here we are interested in the same question, is there significant mean difference in bill length between males and females. However, now we have larger sample size about 300 in total.

penguins |>
  drop_na() |>
  summarize(t_test_model = list(t.test(bill_length_mm ~ sex))) |>
  mutate(ttest_res = map(t_test_model, tidy)) |>
  unnest(ttest_res)

# A tibble: 1 × 11
  t_test  estimate estimate1 estimate2 statistic  p.value parameter conf.low
  <list>     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
1 <htest>    -3.76      42.1      45.9     -6.67 1.07e-10      329.    -4.87
# ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>

Now the results from t.test() is statistically significant, showing that there is a learn difference in the mean values of bill lengths between the two groups.

We can visualize the result as a boxplot and see the clear mean difference.

penguins |>
  drop_na() |>
  ggplot(aes(x=sex, y=bill_length_mm, fill=sex)) +
  geom_boxplot(outlier.shape = NA)+
  geom_jitter(width=0.1)+
  theme(legend.position = "none")+
  labs(title="t-test on real data with larger sample size in each group")