In this post, we will learn how to perform a simple logistic regression using Generalized Linear Models (glm) in R. We will work with logistic regression model between a binary categorical variable as response variable and a single numerical predictor.
Data for logistic regression
Let us load the packages needed.
library(tidyverse) library(palmerpenguins) theme_set(theme_bw(16))
We will use two variables from Palmer penguins dataset. To simplify let us use dataset from one of the Penguin species and select just two variables, sex and body mass. Here, sex is a categorial variable and body mass is numerical variable.
df <- penguins %>% drop_na() %>% filter(species=="Gentoo") %>% select(sex, body_mass_g)
df %>% head() # A tibble: 6 × 2 sex body_mass_g <fct> <int> 1 female 4500 2 male 5700 3 female 4450 4 male 5700 5 male 5400 6 female 4550
Variables for logistic regression
In this logistical regression model, we are interested in classifying the sex of penguins using body mass. Let us make a simple boxplot to see how well sex variable is associated with body mass of penguins.
df %>% ggplot(aes(x=sex, y= body_mass_g, color=sex))+ geom_boxplot(outlier.shape = NA)+ geom_jitter(width=0.1)+ theme(legend.position = "none")+ scale_color_brewer(palette="Set1") ggsave("penguins_sex_vs_body_mass_boxplot.png")
Building Logistic Regression with glm()
To build logistic regression model we will use glm() function from stats package that has implementations of GLMs. glm() function takes three key arguments,
- formula, where we tell glm() the variables and how we want to model
- family name, here we specify the exponential family that we want to use, “biomial” for logistic regression
- data: a dataframe with all the variables specified in the formula
# single predictor glm_res_sp <- df %>% glm(formula = sex ~ body_mass_g, family = binomial)
By printing the resulting object, we can see the basic information regarding the model we built. We get the formula we used and the coeffients from the logistic regression model.
glm_res_sp Call: glm(formula = sex ~ body_mass_g, family = binomial, data = .) Coefficients: (Intercept) body_mass_g -55.03337 0.01089 Degrees of Freedom: 118 Total (i.e. Null); 117 Residual Null Deviance: 164.9 Residual Deviance: 45.1 AIC: 49.1
Results from Logistic Regression with single predictor uslng glm()
To fully investigate the results, we need to use summary() function on the result object. And this gives us the estimated coefficients and the p-value of the association between the two variables. We can see that
glm_res_sp %>% summary()
Call: glm(formula = sex ~ body_mass_g, family = binomial, data = .) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -55.033367 11.651375 -4.723 2.32e-06 *** body_mass_g 0.010888 0.002311 4.711 2.47e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 164.893 on 118 degrees of freedom Residual deviance: 45.105 on 117 degrees of freedom AIC: 49.105 Number of Fisher Scoring iterations: 7
Cleaning Logistic Regression results with broom
The results from statistical models like glm() are often hard to read. So we will use broom package’s tidy() function to convert the resulting statistical object to a tidy dataframe.
glm_res_sp %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -55.0 11.7 -4.72 0.00000232 2 body_mass_g 0.0109 0.00231 4.71 0.00000247
We can also visualize the logistic regression model by plotting the two variables and a logistic curve. Note here we use ggplot2’s geom_smooth() function to make the plot.
df %>% mutate(sex=as.numeric(sex)-1) %>% ggplot(aes(x=body_mass_g, y=sex)) + geom_point() + geom_smooth(method="glm", se=TRUE, method.args = list(family=binomial)) ggsave("plotting_logistic_regression_model_with_glm.png")