How to Extract p-values from multiple simple linear regression models

Sometimes you might fit many simple linear regression models and would like to extract p-values from each model. In this tutorial, we will learn two approaches to extract p-values from multiple simple linear regression models built in R. We will first use for loop to build and extract pvalue from multiple linear models and then we will learn how to use lapply() function in base R to apply lm() and extract p-values from multiple simple linear regression models.

Simulate data for fitting many linear models

To fit many linear models we need data as matrix or a dataframe. Let us simulate some data using random numbers as a matrix.

set.seed(1)
# Simulate 10 features from 50 individuals
mat <- matrix(rnorm(500), nrow=10)

We have 10 rows and 50 samples.

dim(mat)
[1] 10 50

Out data matrix looks like this and we are interested in fitting one linear model for each row.

mat[1:5,1:5]

           [,1]       [,2]        [,3]        [,4]       [,5]
[1,] -0.6264538  1.5117812  0.91897737  1.35867955 -0.1645236
[2,]  0.1836433  0.3898432  0.78213630 -0.10278773 -0.2533617
[3,] -0.8356286 -0.6212406  0.07456498  0.38767161  0.6969634
[4,]  1.5952808 -2.2146999 -1.98935170 -0.05380504  0.5566632
[5,]  0.3295078  1.1249309  0.61982575 -1.37705956 -0.6887557

The common variable for fitting linear regression is in a separate dataframe, here in a tibble.

meta = tibble(condition=c(rep("A", 25), 
                      rep("B", 25)))
meta %>% head()

# A tibble: 6 × 1
  condition
  <chr>    
1 A        
2 A        
3 A        
4 A        
5 A        
6 A

Let us quickly look at the result of building linear model on just one of the rows.

summary(lm(mat[2, ] ~ ., data=meta))


Call:
lm(formula = mat[2, ] ~ ., data = meta)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.03746 -0.66902  0.00681  0.76742  1.98526 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.1485     0.2073   0.717    0.477
conditionB    0.1509     0.2931   0.515    0.609

Residual standard error: 1.036 on 48 degrees of freedom
Multiple R-squared:  0.005488,  Adjusted R-squared:  -0.01523 
F-statistic: 0.2649 on 1 and 48 DF,  p-value: 0.6092

Extracting p-values from multiple simple linear models using for-loop

Let us use for loop to loop through each row of the data matrix and build simple linear regression model. We would extract p-value using summary() function on the linear fit using coefficients method as before

pvals <-  rep(10,0)
for (i in 1:10) {
    pvals[i] <- summary(lm(mat[i, ] ~ ., data=meta))$coefficients[2,4]
}

pvals

 [1] 0.85331563 0.60915956 0.49360471 0.32433361 0.06477015 0.85330936
 [7] 0.16260390 0.63295885 0.07527383 0.60309934

Extracting p-values from multiple simple linear models using lapply() function

Another approach we can use to extract p-values from many models is to avail the fact that lm() function can work with matrix as input and perform linear fit on all columns in the data matrix at the same time.

For example, by using transpose of our data matrix as argument to lm() function we can fit linear models for each column. This would give us a list linear fits. Then we can apply summary() function on the resulting linear fits and get summary for each linear fit. Here is a look at the first two linear fits’ summary.

summary(lm(t(mat) ~ ., data = meta))

Response Y1 :

Call:
lm(formula = Y1 ~ condition, data = meta)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7113 -0.5965 -0.1143  0.8177  2.3394 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.06218    0.21619   0.288    0.775
conditionB   0.05683    0.30573   0.186    0.853

Residual standard error: 1.081 on 48 degrees of freedom
Multiple R-squared:  0.0007194, Adjusted R-squared:  -0.0201 
F-statistic: 0.03455 on 1 and 48 DF,  p-value: 0.8533


Response Y2 :

Call:
lm(formula = Y2 ~ condition, data = meta)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.03746 -0.66902  0.00681  0.76742  1.98526 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.1485     0.2073   0.717    0.477
conditionB    0.1509     0.2931   0.515    0.609

Residual standard error: 1.036 on 48 degrees of freedom
Multiple R-squared:  0.005488,  Adjusted R-squared:  -0.01523 
F-statistic: 0.2649 on 1 and 48 DF,  p-value: 0.6092

....
.....

We can then use base R’s lapply() function on each of the linear fit and extract the p-value. lapply() function applies a function over all the elements of a list or vector. In our example we have list of summary objects. And we can write an anonymous function that extracts the pval as shown below. lapply returns

a list of the same length as the input list, where each element is the result of applying the function to the corresponding element.

lapply(summary(lm(t(mat) ~ ., data = meta)),
       function(x){x$coefficients[2,4]})

$`Response Y1`
[1] 0.8533156

$`Response Y2`
[1] 0.6091596

$`Response Y3`
[1] 0.4936047

$`Response Y4`
[1] 0.3243336

$`Response Y5`
[1] 0.06477015

We can then convert the resulting list of pvalues to vector of p-values using as.numeric() function.

lapply(summary(lm(t(mat) ~ ., data=meta)),
       function(x){x$coefficients[2,4]}) %>% as.numeric()

0.85331563 0.60915956 0.49360471 0.32433361 0.06477015 0.85330936
 [7] 0.16260390 0.63295885 0.07527383 0.60309934

In summary, we have seen two approaches to extract p-values from many linear models. The first one uses for-loop and the second one uses lapply() function. In a future post, we will learn how to use tiddyverse approach to extract p-values from multiple simple linear models.