• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Rstats 101

Learn R Programming Tips & Tricks for Statistics and Data Science

  • Home
  • About
    • Privacy Policy
  • Show Search
Hide Search

T-test on real data using tidyverse

rstats101 · August 28, 2024 ·

In this post, we will learn how to perform t-test on a real dataset and use tidyverse framework to access results from t-test. Check out the post how to do t-test to learn base R approach to t-tes.

We will use Palmer penguin dataset, a real world dataset, to show how to perform t-test. First, we will start with smaller version of penguin dataset and do t-test and then use all of penguin dataset and do t-test. This example shows the effect of sample size on the results from t-test.

Let us get started loading the packages needed.

library(tidyverse)
library(palmerpenguin)
library(broom)
theme_set(theme_bw(16)

First, let us sub-sample palmer penguin dataset to get a dataset of size 30. Here we are interested in understanding the relationship between sex and bill length from 30 penguins. Here 15 penguins are males and the remaining penguins are females. The question of interest is the mean bill length differ between males and females. We will use t-test to test if there is statistically significant difference between the mean values of bill length between males and females.

set.seed(1234)
df <- penguins |>
  drop_na() |>
  select(sex, bill_length_mm) |>
  group_by(sex) |>
  slice_sample(n=15) |>
  ungroup()

df

# A tibble: 30 × 2
   sex    bill_length_mm
   <fct>           <dbl>
 1 female           35.7
 2 female           45.5
 3 female           47.6
 4 female           43.8
 5 female           45.2
 6 female           46.6
 7 female           45.4
 8 female           46.7
 9 female           46.5
10 female           46.6
# ℹ 20 more rows

Since we have the data needed in a dataframe, we will use t.test() by providing the formula as argument to do the t-test.

T-test() with tidyverse

One way to use tidyverse framework to do t-test is as shown below.

df |>
  summarize(t_test_pval = t.test(bill_length_mm ~ sex)$p.value)

  t_test
   <dbl>
 0.239				

If you are interested in saving the full t.test result object, here is a way with tidyverse to do the t-test. Here we first perform t-test and save the resulting object as a list column. And then use broom’s tidy() function to get the result as a dataframe with the help of map() function from purrr package.

library(broom)
df |>
  summarize(t_test_model = list(t.test(bill_length_mm ~ sex))) |>
  mutate(ttest_res = map(t_test_model, tidy)) |>
  unnest(ttest_res)

# A tibble: 1 × 11
  t_test  estimate estimate1 estimate2 statistic p.value parameter conf.low
  <list>     <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
1 <htest>    -2.01      44.6      46.7     -1.21   0.239      25.6    -5.45
# ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>

We can see that p.value from the t-test is 0.24, i.e. there is no statistically significant difference between the mean values bill length of males and females. Let us confirm that visually by making a boxplot of bill length for males and females.

df |>
  ggplot(aes(x=sex, y=bill_length_mm, fill=sex)) +
  geom_boxplot(outlier.shape = NA)+
  geom_jitter(width=0.1)+
  theme(legend.position = "none")

The boxplot shows that even though the mean values are slightly different, a huge variation in bill length mainly in male penguins gives the result we saw.

Applying t-test on real data with sample size of 15 in each group
Applying t-test on real data with sample size of 15 in each group

T-test() using large sample sized data with tidyverse

Instead of applying t-test on 30 samples, let us perform t-test on all the samples in the Palmer penguin data. Here we are interested in the same question, is there significant mean difference in bill length between males and females. However, now we have larger sample size about 300 in total.

penguins |>
  drop_na() |>
  summarize(t_test_model = list(t.test(bill_length_mm ~ sex))) |>
  mutate(ttest_res = map(t_test_model, tidy)) |>
  unnest(ttest_res)

# A tibble: 1 × 11
  t_test  estimate estimate1 estimate2 statistic  p.value parameter conf.low
  <list>     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
1 <htest>    -3.76      42.1      45.9     -6.67 1.07e-10      329.    -4.87
# ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>

Now the results from t.test() is statistically significant, showing that there is a learn difference in the mean values of bill lengths between the two groups.

We can visualize the result as a boxplot and see the clear mean difference.

penguins |>
  drop_na() |>
  ggplot(aes(x=sex, y=bill_length_mm, fill=sex)) +
  geom_boxplot(outlier.shape = NA)+
  geom_jitter(width=0.1)+
  theme(legend.position = "none")+
  labs(title="t-test on real data with larger sample size in each group")

Applying t-test on real data with larger sample size in each group
Applying t-test on real data with larger sample size in each group

Related

Filed Under: rstats101, t.test(), tidyverse Tagged With: t-test with tidyverse

Primary Sidebar

Recent Posts

  • How to create a nested dataframe with lists
  • How to compute proportion with tidyverse
  • How to Compute Z-Score of Multiple Columns
  • How to drop unused level of factor variable in R
  • How to compute Z-score

Categories

%in% arrange() as.data.frame as_tibble built-in data R colSums() R cor() in R data.frame dplyr dplyr across() dplyr group_by() dplyr rename() dplyr rowwise() dplyr row_number() dplyr select() dplyr slice_max() dplyr slice_sample() drop_na R duplicated() gsub head() impute with mean values is.element() linear regression matrix() function na.omit R NAs in R near() R openxlsx pivot_longer() prod() R.version replace NA replace NAs tidyverse R Function rstats rstats101 R version scale() sessionInfo() t.test() tidyr tidyselect tidyverse write.xlsx

Copyright © 2025 · Daily Dish Pro on Genesis Framework · WordPress · Log in

Go to mobile version