In this post, we will learn how to perform t-test on a real dataset and use tidyverse framework to access results from t-test. Check out the post how to do t-test to learn base R approach to t-tes.
We will use Palmer penguin dataset, a real world dataset, to show how to perform t-test. First, we will start with smaller version of penguin dataset and do t-test and then use all of penguin dataset and do t-test. This example shows the effect of sample size on the results from t-test.
Let us get started loading the packages needed.
library(tidyverse) library(palmerpenguin) library(broom) theme_set(theme_bw(16)
First, let us sub-sample palmer penguin dataset to get a dataset of size 30. Here we are interested in understanding the relationship between sex and bill length from 30 penguins. Here 15 penguins are males and the remaining penguins are females. The question of interest is the mean bill length differ between males and females. We will use t-test to test if there is statistically significant difference between the mean values of bill length between males and females.
set.seed(1234) df <- penguins |> drop_na() |> select(sex, bill_length_mm) |> group_by(sex) |> slice_sample(n=15) |> ungroup() df # A tibble: 30 × 2 sex bill_length_mm <fct> <dbl> 1 female 35.7 2 female 45.5 3 female 47.6 4 female 43.8 5 female 45.2 6 female 46.6 7 female 45.4 8 female 46.7 9 female 46.5 10 female 46.6 # ℹ 20 more rows
Since we have the data needed in a dataframe, we will use t.test() by providing the formula as argument to do the t-test.
T-test() with tidyverse
One way to use tidyverse framework to do t-test is as shown below.
df |> summarize(t_test_pval = t.test(bill_length_mm ~ sex)$p.value) t_test <dbl> 0.239
If you are interested in saving the full t.test result object, here is a way with tidyverse to do the t-test. Here we first perform t-test and save the resulting object as a list column. And then use broom’s tidy() function to get the result as a dataframe with the help of map() function from purrr package.
library(broom) df |> summarize(t_test_model = list(t.test(bill_length_mm ~ sex))) |> mutate(ttest_res = map(t_test_model, tidy)) |> unnest(ttest_res) # A tibble: 1 × 11 t_test estimate estimate1 estimate2 statistic p.value parameter conf.low <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 <htest> -2.01 44.6 46.7 -1.21 0.239 25.6 -5.45 # ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
We can see that p.value from the t-test is 0.24, i.e. there is no statistically significant difference between the mean values bill length of males and females. Let us confirm that visually by making a boxplot of bill length for males and females.
df |> ggplot(aes(x=sex, y=bill_length_mm, fill=sex)) + geom_boxplot(outlier.shape = NA)+ geom_jitter(width=0.1)+ theme(legend.position = "none")
The boxplot shows that even though the mean values are slightly different, a huge variation in bill length mainly in male penguins gives the result we saw.
T-test() using large sample sized data with tidyverse
Instead of applying t-test on 30 samples, let us perform t-test on all the samples in the Palmer penguin data. Here we are interested in the same question, is there significant mean difference in bill length between males and females. However, now we have larger sample size about 300 in total.
penguins |> drop_na() |> summarize(t_test_model = list(t.test(bill_length_mm ~ sex))) |> mutate(ttest_res = map(t_test_model, tidy)) |> unnest(ttest_res) # A tibble: 1 × 11 t_test estimate estimate1 estimate2 statistic p.value parameter conf.low <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 <htest> -3.76 42.1 45.9 -6.67 1.07e-10 329. -4.87 # ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
Now the results from t.test() is statistically significant, showing that there is a learn difference in the mean values of bill lengths between the two groups.
We can visualize the result as a boxplot and see the clear mean difference.
penguins |> drop_na() |> ggplot(aes(x=sex, y=bill_length_mm, fill=sex)) + geom_boxplot(outlier.shape = NA)+ geom_jitter(width=0.1)+ theme(legend.position = "none")+ labs(title="t-test on real data with larger sample size in each group")