How to get p-value from linear regression model

In this tutorial, we will learn how to extract p-value from a linear regression fit object in R. We will use two approaches to pull out p-value and other statistics of interest from a linear regression model. First we will extract the p-value directly from summary of the linear regression fit object using coefficients argument. And then we will use broom R package that can clean up the results from statistical model fit objects like simple linear regression.

Let us get started by loading the packages needed.

library(broom)
library(tidyverse)

In this example, we will perform simple linear regression analysis using built-in iris dataset available as a dataframe.

iris %>% head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

To perform simple linear regression with data available in a dataframe, we can specify data argument in lm() function. The advantage of using data argument is that we can specify the names of the columns in the dataframe as variables without using any quotes to fit linear regression model.

In the example below, we fit simple linear regression model between two numerical variables in the iris dataset.

# fit linear regression using lm() function with data argument
lm_fit <- lm(Sepal.Length ~ Petal.Length, data=iris)

We can look at the summary of the linear regression fit using summary() function.summary() function gives us lots of information. We can see that the Sepal.Length is strongly associated with Petal.Length in the iris dataset. The p.value is very small and R-squared is about 0.75.

summary(lm_fit)

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.24675 -0.29657 -0.01515  0.27676  1.00269 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.30660    0.07839   54.94   <2e-16 ***
## Petal.Length  0.40892    0.01889   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4071 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

Our main interest p.value is somewhere in the middle of the summary results with no obvious way to get the p.value alone.

Using str() function on the result of summary() on linear model fit, we can see that p.value is available as part of coefficients variable.

str(summary(lm_fit))
## List of 11
##  $ call         : language lm(formula = Sepal.Length ~ Petal.Length, data = iris)
##  $ terms        :Classes 'terms', 'formula'  language Sepal.Length ~ Petal.Length
##   .. ..- attr(*, "variables")= language list(Sepal.Length, Petal.Length)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "Sepal.Length" "Petal.Length"
##   .. .. .. ..$ : chr "Petal.Length"
##   .. ..- attr(*, "term.labels")= chr "Petal.Length"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(Sepal.Length, Petal.Length)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "Sepal.Length" "Petal.Length"
##  $ residuals    : Named num [1:150] 0.2209 0.0209 -0.1382 -0.32 0.1209 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ coefficients : num [1:2, 1:4] 4.3066 0.4089 0.0784 0.0189 54.9389 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "Petal.Length"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "Petal.Length"
##  $ sigma        : num 0.407
##  $ df           : int [1:3] 2 148 2
##  $ r.squared    : num 0.76
##  $ adj.r.squared: num 0.758
##  $ fstatistic   : Named num [1:3] 469 1 148
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:2, 1:2] 0.03708 -0.00809 -0.00809 0.00215
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "Petal.Length"
##   .. ..$ : chr [1:2] "(Intercept)" "Petal.Length"
##  - attr(*, "class")= chr "summary.lm"

coefficients argument gives us the effect size (the estimate), statistic, and p.value for each term as a dataframe.

summary(lm_fit)$coefficients
##               Estimate Std. Error  t value      Pr(>|t|)
## (Intercept)  4.3066034 0.07838896 54.93890 2.426713e-100
## Petal.Length 0.4089223 0.01889134 21.64602  1.038667e-47

Now we can pull pull p-value from the dataframe

First we pull all the p.values available in the column Pr(>|t|).

# pull p.values from linear regression model
summary(lm_fit)$coefficients[ ,4]

Here we specifically pull p.value for the association between Sepal.Length and Petal.Length.

# pull p.value from linear regression model
summary(lm_fit)$coefficients[2,4]

broom package. tidy the results. from linear regression model

Another option to get the p.value is to use broom R package.

The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy tibbles.

tidy() function in broom package gives the result of coefficients in a tidy tibble format.

broom::tidy(lm_fit)

## # A tibble: 2 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     4.31     0.0784      54.9 2.43e-100
## 2 Petal.Length    0.409    0.0189      21.6 1.04e- 47

Note the column names of the resulting tibble is much cleaner making it easier to get what we need.

To get all the p.values from the linear model we can use pull() function on the p.value column.

broom::tidy(lm_fit) %>%
  pull(p.value)

And to get the p.value. for the specific term, we can using pull() function in combination with filter() function as shown below.

broom::tidy(lm_fit) %>%
  filter(term=="Petal.Length") %>%
  pull(p.value)