In this tutorial we will learn how to perform simple linear regression between two numerical variables in R using lm() function. The resulting object from a linear regression analysis using lm() is unwieldy and not intuitive for beginners. We use broom package that is part of tidymodels to make the messy output from lm() into human friendly tidy data frames.
First let us load the packages needed.
library(tidyverse) library(palmerpenguins) library(broom) theme_set(theme_bw(16))
Our Penguins data looks like this.
penguins %>% head() # A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 # ℹ 2 more variables: sex <fct>, year <int>
We will be using two numerical variables, bill length and flipper length of Penguins, and look at their association using the linear model. First, let use visualize the relationship between the two variables with a scatter plot made with ggplot2.
penguins %>% ggplot(aes(bill_length_mm, flipper_length_mm))+ geom_point()+ geom_smooth(method="lm") ggsave("regression_quanitative_variables.png")
Linear Regression with lm() in R
lm() function in R let us build linear regression model. We provide the formula describing our linear model as the first argument and then data as second argument. Note that in the simple linear regression we specify the model as bill_length_mm ~ flipper_length_mm.
lm_res = lm(bill_length_mm ~ flipper_length_mm, data=penguins)
We also save the result of our linear regression model. By simply using the resulting variable name we get the basic information about the model, the formula used in the model and the coefficients after fitting the data.
lm_res Call: lm(formula = bill_length_mm ~ flipper_length_mm, data = penguins) Coefficients: (Intercept) flipper_length_mm -7.2649 0.2548
Extracting the results linear regression result object
Often we need more information from the results of linear regression model. In R, we can use summary() function on the lm object to get more information about the model.
The results from summary() function contain summary of residuals of the model, coefficient estimates and the p-values. Extracting any information from the summary() of linear model is difficult.
summary(lm_res) Call: lm(formula = bill_length_mm ~ flipper_length_mm, data = penguins) Residuals: Min 1Q Median 3Q Max -8.5792 -2.6715 -0.5721 2.0148 19.1518 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.26487 3.20016 -2.27 0.0238 * flipper_length_mm 0.25477 0.01589 16.03 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.126 on 340 degrees of freedom (2 observations deleted due to missingness) Multiple R-squared: 0.4306, Adjusted R-squared: 0.4289 F-statistic: 257.1 on 1 and 340 DF, p-value: < 2.2e-16
broom’s glance() function gives a quick look at the results from the linear regression result object.
lm_res %>% broom::glance() # A tibble: 1 × 12 r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0.431 0.429 4.13 257. 1.74e-43 1 -969. 1944. 1955. # 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom’s tidy() function to clean the lm results
With broom, it is much easier to get the results from a linear regression model. tidy() function in broom package will give the coefficient estimates and p-values in a nice tabular form as a tibble.
lm_res %>% broom::tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -7.26 3.20 -2.27 2.38e- 2 2 flipper_length_mm 0.255 0.0159 16.0 1.74e-43
For example we can see that in the “term” column our flipper length variable and see that it has significant association with bill length.
With conf.int=TRUE option we can also get the confidence interval for our estimates.
lm_res %>% broom::tidy(conf.int=TRUE) # A tibble: 2 × 7 term estimate std.error statistic p.value conf.low conf.high <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -7.26 3.20 -2.27 2.38e- 2 -13.6 -0.970 2 flipper_length_mm 0.255 0.0159 16.0 1.74e-43 0.224 0.286
With augment() function in broom we can add the original data to results from linear model.
lm_res %>% augment() # A tibble: 342 × 9 .rownames bill_length_mm flipper_length_mm .fitted .resid .hat .sigma <chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl> 1 1 39.1 181 38.8 0.252 0.00881 4.13 2 2 39.5 186 40.1 -0.622 0.00622 4.13 3 3 40.3 195 42.4 -2.11 0.00344 4.13 4 5 36.7 193 41.9 -5.21 0.00385 4.12 5 6 39.3 190 41.1 -1.84 0.00469 4.13 6 7 38.9 181 38.8 0.0518 0.00881 4.13 7 8 39.2 195 42.4 -3.21 0.00344 4.13 8 9 34.1 193 41.9 -7.81 0.00385 4.11 9 10 42 190 41.1 0.859 0.00469 4.13 10 11 37.8 186 40.1 -2.32 0.00622 4.13 # 332 more rows # 2 more variables: .cooksd <dbl>, .std.resid <dbl>
[…] results from statistical models like glm() are often hard to read. So we will use broom package’s tidy() function to convert the resulting statistical object to a tidy […]