Sometimes you might fit many simple linear regression models and would like to extract p-values from each model. In this tutorial, we will learn two approaches to extract p-values from multiple simple linear regression models built in R. We will first use for loop to build and extract pvalue from multiple linear models and then we will learn how to use lapply() function in base R to apply lm() and extract p-values from multiple simple linear regression models.
Simulate data for fitting many linear models
To fit many linear models we need data as matrix or a dataframe. Let us simulate some data using random numbers as a matrix.
set.seed(1) # Simulate 10 features from 50 individuals mat <- matrix(rnorm(500), nrow=10)
We have 10 rows and 50 samples.
dim(mat) [1] 10 50
Out data matrix looks like this and we are interested in fitting one linear model for each row.
mat[1:5,1:5] [,1] [,2] [,3] [,4] [,5] [1,] -0.6264538 1.5117812 0.91897737 1.35867955 -0.1645236 [2,] 0.1836433 0.3898432 0.78213630 -0.10278773 -0.2533617 [3,] -0.8356286 -0.6212406 0.07456498 0.38767161 0.6969634 [4,] 1.5952808 -2.2146999 -1.98935170 -0.05380504 0.5566632 [5,] 0.3295078 1.1249309 0.61982575 -1.37705956 -0.6887557
The common variable for fitting linear regression is in a separate dataframe, here in a tibble.
meta = tibble(condition=c(rep("A", 25), rep("B", 25))) meta %>% head() # A tibble: 6 × 1 condition <chr> 1 A 2 A 3 A 4 A 5 A 6 A
Let us quickly look at the result of building linear model on just one of the rows.
summary(lm(mat[2, ] ~ ., data=meta)) Call: lm(formula = mat[2, ] ~ ., data = meta) Residuals: Min 1Q Median 3Q Max -3.03746 -0.66902 0.00681 0.76742 1.98526 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1485 0.2073 0.717 0.477 conditionB 0.1509 0.2931 0.515 0.609 Residual standard error: 1.036 on 48 degrees of freedom Multiple R-squared: 0.005488, Adjusted R-squared: -0.01523 F-statistic: 0.2649 on 1 and 48 DF, p-value: 0.6092
Extracting p-values from multiple simple linear models using for-loop
Let us use for loop to loop through each row of the data matrix and build simple linear regression model. We would extract p-value using summary() function on the linear fit using coefficients method as before
pvals <- rep(10,0) for (i in 1:10) { pvals[i] <- summary(lm(mat[i, ] ~ ., data=meta))$coefficients[2,4] }
pvals [1] 0.85331563 0.60915956 0.49360471 0.32433361 0.06477015 0.85330936 [7] 0.16260390 0.63295885 0.07527383 0.60309934
Extracting p-values from multiple simple linear models using lapply() function
Another approach we can use to extract p-values from many models is to avail the fact that lm() function can work with matrix as input and perform linear fit on all columns in the data matrix at the same time.
For example, by using transpose of our data matrix as argument to lm() function we can fit linear models for each column. This would give us a list linear fits. Then we can apply summary() function on the resulting linear fits and get summary for each linear fit. Here is a look at the first two linear fits’ summary.
summary(lm(t(mat) ~ ., data = meta)) Response Y1 : Call: lm(formula = Y1 ~ condition, data = meta) Residuals: Min 1Q Median 3Q Max -2.7113 -0.5965 -0.1143 0.8177 2.3394 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.06218 0.21619 0.288 0.775 conditionB 0.05683 0.30573 0.186 0.853 Residual standard error: 1.081 on 48 degrees of freedom Multiple R-squared: 0.0007194, Adjusted R-squared: -0.0201 F-statistic: 0.03455 on 1 and 48 DF, p-value: 0.8533 Response Y2 : Call: lm(formula = Y2 ~ condition, data = meta) Residuals: Min 1Q Median 3Q Max -3.03746 -0.66902 0.00681 0.76742 1.98526 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1485 0.2073 0.717 0.477 conditionB 0.1509 0.2931 0.515 0.609 Residual standard error: 1.036 on 48 degrees of freedom Multiple R-squared: 0.005488, Adjusted R-squared: -0.01523 F-statistic: 0.2649 on 1 and 48 DF, p-value: 0.6092 .... .....
We can then use base R’s lapply() function on each of the linear fit and extract the p-value. lapply() function applies a function over all the elements of a list or vector. In our example we have list of summary objects. And we can write an anonymous function that extracts the pval as shown below. lapply returns
a list of the same length as the input list, where each element is the result of applying the function to the corresponding element.
lapply(summary(lm(t(mat) ~ ., data = meta)), function(x){x$coefficients[2,4]}) $`Response Y1` [1] 0.8533156 $`Response Y2` [1] 0.6091596 $`Response Y3` [1] 0.4936047 $`Response Y4` [1] 0.3243336 $`Response Y5` [1] 0.06477015
We can then convert the resulting list of pvalues to vector of p-values using as.numeric() function.
lapply(summary(lm(t(mat) ~ ., data=meta)), function(x){x$coefficients[2,4]}) %>% as.numeric() 0.85331563 0.60915956 0.49360471 0.32433361 0.06477015 0.85330936 [7] 0.16260390 0.63295885 0.07527383 0.60309934
In summary, we have seen two approaches to extract p-values from many linear models. The first one uses for-loop and the second one uses lapply() function. In a future post, we will learn how to use tiddyverse approach to extract p-values from multiple simple linear models.