In this tutorial, we will learn how to extract p-value from a linear regression fit object in R. We will use two approaches to pull out p-value and other statistics of interest from a linear regression model. First we will extract the p-value directly from summary of the linear regression fit object using coefficients argument. And then we will use broom R package that can clean up the results from statistical model fit objects like simple linear regression.
Let us get started by loading the packages needed.
library(broom) library(tidyverse)
In this example, we will perform simple linear regression analysis using built-in iris dataset available as a dataframe.
iris %>% head() ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa
To perform simple linear regression with data available in a dataframe, we can specify data argument in lm() function. The advantage of using data argument is that we can specify the names of the columns in the dataframe as variables without using any quotes to fit linear regression model.
In the example below, we fit simple linear regression model between two numerical variables in the iris dataset.
# fit linear regression using lm() function with data argument lm_fit <- lm(Sepal.Length ~ Petal.Length, data=iris)
We can look at the summary of the linear regression fit using summary() function.summary() function gives us lots of information. We can see that the Sepal.Length is strongly associated with Petal.Length in the iris dataset. The p.value is very small and R-squared is about 0.75.
summary(lm_fit)
## ## Call: ## lm(formula = Sepal.Length ~ Petal.Length, data = iris) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.24675 -0.29657 -0.01515 0.27676 1.00269 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.30660 0.07839 54.94 <2e-16 *** ## Petal.Length 0.40892 0.01889 21.65 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4071 on 148 degrees of freedom ## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 ## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
Our main interest p.value is somewhere in the middle of the summary results with no obvious way to get the p.value alone.
Using str() function on the result of summary() on linear model fit, we can see that p.value is available as part of coefficients variable.
str(summary(lm_fit)) ## List of 11 ## $ call : language lm(formula = Sepal.Length ~ Petal.Length, data = iris) ## $ terms :Classes 'terms', 'formula' language Sepal.Length ~ Petal.Length ## .. ..- attr(*, "variables")= language list(Sepal.Length, Petal.Length) ## .. ..- attr(*, "factors")= int [1:2, 1] 0 1 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "Sepal.Length" "Petal.Length" ## .. .. .. ..$ : chr "Petal.Length" ## .. ..- attr(*, "term.labels")= chr "Petal.Length" ## .. ..- attr(*, "order")= int 1 ## .. ..- attr(*, "intercept")= int 1 ## .. ..- attr(*, "response")= int 1 ## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> ## .. ..- attr(*, "predvars")= language list(Sepal.Length, Petal.Length) ## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" ## .. .. ..- attr(*, "names")= chr [1:2] "Sepal.Length" "Petal.Length" ## $ residuals : Named num [1:150] 0.2209 0.0209 -0.1382 -0.32 0.1209 ... ## ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ... ## $ coefficients : num [1:2, 1:4] 4.3066 0.4089 0.0784 0.0189 54.9389 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:2] "(Intercept)" "Petal.Length" ## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)" ## $ aliased : Named logi [1:2] FALSE FALSE ## ..- attr(*, "names")= chr [1:2] "(Intercept)" "Petal.Length" ## $ sigma : num 0.407 ## $ df : int [1:3] 2 148 2 ## $ r.squared : num 0.76 ## $ adj.r.squared: num 0.758 ## $ fstatistic : Named num [1:3] 469 1 148 ## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf" ## $ cov.unscaled : num [1:2, 1:2] 0.03708 -0.00809 -0.00809 0.00215 ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:2] "(Intercept)" "Petal.Length" ## .. ..$ : chr [1:2] "(Intercept)" "Petal.Length" ## - attr(*, "class")= chr "summary.lm"
coefficients argument gives us the effect size (the estimate), statistic, and p.value for each term as a dataframe.
summary(lm_fit)$coefficients ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.3066034 0.07838896 54.93890 2.426713e-100 ## Petal.Length 0.4089223 0.01889134 21.64602 1.038667e-47
Now we can pull pull p-value from the dataframe
First we pull all the p.values available in the column Pr(>|t|).
# pull p.values from linear regression model summary(lm_fit)$coefficients[ ,4]
Here we specifically pull p.value for the association between Sepal.Length and Petal.Length.
# pull p.value from linear regression model summary(lm_fit)$coefficients[2,4]
broom package. tidy the results. from linear regression model
Another option to get the p.value is to use broom R package.
The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy tibbles.
tidy() function in broom package gives the result of coefficients in a tidy tibble format.
broom::tidy(lm_fit) ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 4.31 0.0784 54.9 2.43e-100 ## 2 Petal.Length 0.409 0.0189 21.6 1.04e- 47
Note the column names of the resulting tibble is much cleaner making it easier to get what we need.
To get all the p.values from the linear model we can use pull() function on the p.value column.
broom::tidy(lm_fit) %>% pull(p.value)
And to get the p.value. for the specific term, we can using pull() function in combination with filter() function as shown below.
broom::tidy(lm_fit) %>% filter(term=="Petal.Length") %>% pull(p.value)