Let’s pretend I take a random sample of 1200 American male adults and I ask them two questions:

- What’s your annual salary?
- Do you watch NFL football?

Further, let’s pretend they all answer, and they all answer honestly! Now I want to know if a person’s annual salary is associated with whether or not they watch NFL football. In other words, does the answer to the first question above help me predict the answer to the second question? In statistical language, I want to know if the continuous variable (salary) is associated with the binary variable (NFL? yes or no). Logistic regression helps us answer this question.

Before I go on I must say that whole books are devoted to logistic regression. I don’t intend to explain all of logistic regression in this one post. But I do hope to illustrate it with a simple example and hopefully provide some insight to what it does.

In classic linear regression, the response is continuous. For example, does a person’s weight help me predict diastolic blood pressure? The response is diastolic blood pressure and it’s a continuous variable. A good way to picture linear regression is to think of a scatter plot of two variables and imagine a straight line going through the dots that best represents the association between the two variables. Like this:

In my NFL/salary example, the response is “yes” or “no”. We can translate that to numbers as “yes”=1 and “no”=0.Â Now we have a binary response. It can only take two values: 1 or 0. If we do a scatter plot of the two variables salary and NFL, we get something like this:

Trying to draw a line through that to describe association between salary and 0/1 seems pretty dumb. Instead what we do is try to predict *the probability* of taking a 0 or 1. And that’s the basic idea of logistic regression: develop a model for predicting *the probability* of a binary response. So we do eventually have “a line” we draw through the plot above, but it’s a smooth curve in the shape of an “S” that depicts increasing (or decreasing) probability of taking a 0 or 1 value.

The logistic model is as follows:

where

So we use statistical software to find the and values (i.e., the coefficients), multiply it by our predictor (in our example, salary), and plug in to the inverse logit function to get a predicted probability of taking a 0 or 1. Let’s do that.

I generated some data in R as follows:

#simulate salaries between $23,000 and $100,000 salary <- round(runif(1200,23,100)) #simulate yes/no response to watching NFL nfl <- c() for (i in 1:1200){ Â Â Â if (salary[i] < 61) {nfl[i] <- rbinom(1,1,0.75)} Â Â Â else {nfl[i] <- rbinom(1,1,0.25)} }

The salaries are random in the range between 23 and 100. Obviously they represent thousands of dollars. The NFL response is not totally random. I rigged it so that a person making more money is less likely to watch the NFL. I have no idea if that's true or not, nor do I care. I just wanted to generate data with some association. Using R we can carry out the logistic regression as follows:

fit <- glm (nfl ~ salary, family=binomial(link="logit"))

We get the following results:

and .

Trust me when I say that those coefficients are statistically significant. Our model is .

Now we can use them to predict probabilities. What's the probability someone earning $40,000/year watches the NFL?

What's the probability someone earning $80,000/year watches the NFL?

I'm leaving out tons of math and theory here. As I said at the outset, there's much more to logistic regression. But what I just demonstrated is the basic idea. With our model we can now draw a line through our plot:

The line represents probability. It decreases as salary increases. People with higher salaries are less likely to say they watch the NFL. At least according to the world of my made up data.