Logistic Regression with Single Predictor in R

In this post, we will learn how to perform a simple logistic regression using Generalized Linear Models (glm) in R. We will work with logistic regression model between a binary categorical variable as response variable and a single numerical predictor.

Data for logistic regression

Let us load the packages needed.

library(tidyverse)
library(palmerpenguins)
theme_set(theme_bw(16))

We will use two variables from Palmer penguins dataset. To simplify let us use dataset from one of the Penguin species and select just two variables, sex and body mass. Here, sex is a categorial variable and body mass is numerical variable.

df <- penguins %>%
  drop_na() %>%
  filter(species=="Gentoo")   %>%
  select(sex, body_mass_g)

df %>% head()

# A tibble: 6 × 2
  sex    body_mass_g
  <fct>        <int>
1 female        4500
2 male          5700
3 female        4450
4 male          5700
5 male          5400
6 female        4550

Variables for logistic regression

In this logistical regression model, we are interested in classifying the sex of penguins using body mass. Let us make a simple boxplot to see how well sex variable is associated with body mass of penguins.

df   %>%
  ggplot(aes(x=sex, y= body_mass_g, color=sex))+
  geom_boxplot(outlier.shape = NA)+
  geom_jitter(width=0.1)+
  theme(legend.position = "none")+
  scale_color_brewer(palette="Set1")
ggsave("penguins_sex_vs_body_mass_boxplot.png")

Logistic regression to classify Penguin sex with one predictor

Building Logistic Regression with glm()

To build logistic regression model we will use glm() function from stats package that has implementations of GLMs. glm() function takes three key arguments,

formula, where we tell glm() the variables and how we want to model
family name, here we specify the exponential family that we want to use, “biomial” for logistic regression
data: a dataframe with all the variables specified in the formula

# single predictor
glm_res_sp <- df %>%
  glm(formula = sex ~ body_mass_g,
      family = binomial)

By printing the resulting object, we can see the basic information regarding the model we built. We get the formula we used and the coeffients from the logistic regression model.

glm_res_sp

Call:  glm(formula = sex ~ body_mass_g, family = binomial, data = .)

Coefficients:
(Intercept)  body_mass_g  
  -55.03337      0.01089  

Degrees of Freedom: 118 Total (i.e. Null);  117 Residual
Null Deviance:      164.9 
Residual Deviance: 45.1     AIC: 49.1

Results from Logistic Regression with single predictor uslng glm()

To fully investigate the results, we need to use summary() function on the result object. And this gives us the estimated coefficients and the p-value of the association between the two variables. We can see that

glm_res_sp %>% summary()

Call:
glm(formula = sex ~ body_mass_g, family = binomial, data = .)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -55.033367  11.651375  -4.723 2.32e-06 ***
body_mass_g   0.010888   0.002311   4.711 2.47e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 164.893  on 118  degrees of freedom
Residual deviance:  45.105  on 117  degrees of freedom
AIC: 49.105

Number of Fisher Scoring iterations: 7

Cleaning Logistic Regression results with broom

The results from statistical models like glm() are often hard to read. So we will use broom package’s tidy() function to convert the resulting statistical object to a tidy dataframe.

glm_res_sp %>%
   tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept) -55.0     11.7         -4.72 0.00000232
2 body_mass_g   0.0109   0.00231      4.71 0.00000247

We can also visualize the logistic regression model by plotting the two variables and a logistic curve. Note here we use ggplot2’s geom_smooth() function to make the plot.

df %>%
  mutate(sex=as.numeric(sex)-1) %>%
  ggplot(aes(x=body_mass_g, y=sex)) + 
  geom_point() +
  geom_smooth(method="glm",
              se=TRUE,
              method.args = list(family=binomial))
ggsave("plotting_logistic_regression_model_with_glm.png")