Linear Regression in R with lm() function - A Practical Tutorial

In this tutorial, we will learn how to perform a simple linear regression in R using lm() function.

Simple Linear regression is one of the popular and common statistical methods that is used to understand the relationship between two numerical or quantitative variables, like height and weight of humans, age and height, years of education and salary, and so on. One can think of doing simple linear regression as trying answer the question are the two numerical variables of interest are associated/related.

Statistically, the act of doing linear regression analysis amounts to this, given a data set of the form (x1,y1), (x2,y2), (x3,y3),…, (xn,yn), we are trying to fit a linear model y = mx + c, where c is intercept, where the line meets y-axis and m is the of the slope of the straight line.

We need some data to start with fitting linear regression model. Let us simulate data for both x and y as follows.

set.seed(42)
y <- rnorm(50, mean=5, sd=2)
x <- y + rnorm(50, mean=1, sd=1)

When you have data set read as vectors like we have now, we can use lm() function to do the simple linear regression analysis by writing as lm(y ~ x)

lm_fit_1 <- lm(y ~ x)

The resulting object from lm() function in R is our linear fit to the data. By printing the fit variable, we get the two parameters , intercept and slope that we estimated from our data. It will also tell you the model that was fit. In this case our model is between two variables y and x with formula specified as lm(y ~ x).

lm_fit_1

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     -0.6944       0.9326

Another useful function to understand the result of fitting a linear model is summary() function. When use summary() function on the fit object, it gives us detailed information about the results from linear regression.

First, it tells what was model that was used, lm(formula = y~x). Looking at the coefficients we can get the intercept and slope from the linear regression analysis. And it also gives you the p-value from testing the association between the two numerical variables.

# summary of the linear fit
summary(lm_fit_1)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32189 -0.61615 -0.05203  0.71950  3.16215 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.69442    0.37363  -1.859   0.0692 .  
## x            0.93262    0.05808  16.059   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9217 on 48 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8398 
## F-statistic: 257.9 on 1 and 48 DF,  p-value: < 2.2e-16

In the example data that we used to fit simple linear regression, the association between the two variables is very strong. And we can tell that by looking at the “adjusted R-squared” value and the p-value.

We can also get the slope and intercept of the model by using coef() function on the fit object.

#Calculate slope and intercept of line of best fit
coef(lm_fit_1)

## (Intercept)           x 
##  -0.6944232   0.9326167

Let us visualize our data with the results from the linear regression analysis. Using base R plotting function plot(), we can make a scatter plot between the two variables and add linear regression line on top of it using abline() function with linear regression fit object as argument.

plot(x, y)
abline(lm_fit_1)

ANother way to visualize the data and the linear regression results is to use ggplot2 from tidyverse. Here, we make a scatter plot first using geom_point() and then add the regression line using geom_smooth() function.

library(tidyverse)
tibble(x=x,y=y) |>
  ggplot(aes(x,y))+
  geom_point()+
  theme_bw(16)+
  geom_smooth(method = "lm", se = FALSE)