Tag Archives: logistic regression

Profile likelihood ratio confidence intervals

When you fit a generalized linear model (GLM) in R and call confint on the model object, you get confidence intervals for the model coefficients. But you also get an interesting message:

Waiting for profiling to be done...

What's that all about? What exactly is being profiled? Put simply, it's telling you that it's calculating a profile likelihood ratio confidence interval.

The typical way to calculate a 95% confidence interval is to multiply the standard error of an estimate by some normal quantile such as 1.96 and add/subtract that product to/from the estimate to get an interval. In the context of GLMs, we sometimes call that a Wald confidence interval.

Another way to determine an upper and lower bound of plausible values for a model coefficient is to find the minimum and maximum value of the set of all coefficients that satisfy the following:

\[-2\log\left(\frac{L(\beta_{0}, \beta_{1}|y_{1},…,y_{n})}{L(\hat{\beta_{0}}, \hat{\beta_{1}}|y_{1},…,y_{n})}\right) < \chi_{1,1-\alpha}^{2}\]

Inside the parentheses is a ratio of likelihoods. In the denominator is the likelihood of the model we fit. In the numerator is the likelihood of the same model but with different coefficients. (More on that in a moment.) We take the log of the ratio and multiply by -2. This gives us a likelihood ratio test (LRT) statistic. This statistic is typically used to test whether a coefficient is equal to some value, such as 0, with the null likelihood in the numerator (model without coefficient, that is, equal to 0) and the alternative or estimated likelihood in the denominator (model with coefficient). If the LRT statistic is less than \(\chi_{1,0.95}^{2} \approx 3.84\), we fail to reject the null. The coefficient is statisically not much different from 0. That means the likelihood ratio is close to 1. The likelihood of the model without the coefficient is almost as high the model with it. On the other hand, if the ratio is small, that means the likelihood of the model without the coefficient is much smaller than the likelihood of the model with the coefficient. This leads to a larger LRT statistic since it's being log transformed, which leads to a value larger than 3.84 and thus rejection of the null.

Now in the formula above, we are seeking all such coefficients in the numerator that would make it a true statement. You might say we're “profiling” many different null values and their respective LRT test statistics. Do they fit the profile of a plausible coefficient value in our model? The smallest value we can get without violating the condition becomes our lower bound, and likewise with the largest value. When we're done we'll have a range of plausible values for our model coefficient that gives us some indication of the uncertainly of our estimate.

Let's load some data and fit a binomial GLM to illustrate these concepts. The following R code comes from the help page for confint.glm. This is an example from the classic Modern Applied Statistics with S. ldose is a dosing level and sex is self-explanatory. SF is number of successes and failures, where success is number of dead worms. We're interested in learning about the effects of dosing level and sex on number of worms killed. Presumably this worm is a pest of some sort.

# example from Venables and Ripley (2002, pp. 190-2.)
ldose <- rep(0:5, 2)
numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
sex <- factor(rep(c("M", "F"), c(6, 6)))
SF <- cbind(numdead, numalive = 20-numdead)
budworm.lg <- glm(SF ~ sex + ldose, family = binomial)
## Call:
## glm(formula = SF ~ sex + ldose, family = binomial)
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.10540  -0.65343  -0.02225   0.48471   1.42944  
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4732     0.4685  -7.413 1.23e-13 ***
## sexM          1.1007     0.3558   3.093  0.00198 ** 
## ldose         1.0642     0.1311   8.119 4.70e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 124.8756  on 11  degrees of freedom
## Residual deviance:   6.7571  on  9  degrees of freedom
## AIC: 42.867
## Number of Fisher Scoring iterations: 4

The coefficient for ldose looks significant. Let's determine a confidence interval for the coefficient using the confint function. We call confint on our model object, budworm.lg and use the parm argument to specify that we only want to do it for ldose:

confint(budworm.lg, parm = "ldose")
## Waiting for profiling to be done...
##     2.5 %    97.5 % 
## 0.8228708 1.3390581

We get our “waiting” message though there really was no wait. If we fit a larger model and request multiple confidence intervals, then there might actually be a waiting period of a few seconds. The lower bound is about 0.8 and the upper bound about 1.32. We might say every increase in dosing level increase the log odds of killing worms by at least 0.8. We could also exponentiate to get a CI for an odds ratio estimate:

exp(confint(budworm.lg, parm = "ldose"))
## Waiting for profiling to be done...
##    2.5 %   97.5 % 
## 2.277027 3.815448

The odds of “success” (killing worms) is at least 2.3 times higher at one dosing level versus the next lower dosing level.

To better understand the profile likelihood ratio confidence interval, let's do it “manually”. Recall the denominator in the formula above was the likelihood of our fitted model. We can extract that with the logLik function:

den <- logLik(budworm.lg)
## 'log Lik.' -18.43373 (df=3)

The numerator was the likelihood of a model with a different coefficient. Here's the likelihood of a model with a coefficient of 1.05:

num <- logLik(glm(SF ~ sex + offset(1.05*ldose), family = binomial))
## 'log Lik.' -18.43965 (df=2)

Notice we used the offset function. That allows us to fix the coefficient to 1.05 and not have it estimated.

Since we already extracted the log likelihoods, we need to subtract them. Remember this rule from algebra?

\[\log\frac{M}{N} = \log M – \log N\]

So we subtract the denominator from the numerator, multiply by -2, and check if it's less than 3.84, which we calculate with qchisq(p = 0.95, df = 1)

-2*(num - den)
## 'log Lik.' 0.01184421 (df=2)
-2*(num - den) < qchisq(p = 0.95, df = 1)
## [1] TRUE

It is. 1.05 seems like a plausible value for the ldose coefficient. That makes sense since the estimated value was 1.0642. Let's try it with a larger value, like 1.5:

num <- logLik(glm(SF ~ sex + offset(1.5*ldose), family = binomial))
-2*(num - den) < qchisq(p = 0.95, df = 1)
## [1] FALSE

FALSE. 1.5 seems too big to be a plausible value for the ldose coefficient.

Now that we have the general idea, we can program a while loop to check different values until we exceed our threshold of 3.84.

cf <- budworm.lg$coefficients[3]  # fitted coefficient 1.0642
cut <- qchisq(p = 0.95, df = 1)   # about 3.84
e <- 0.001                        # increment to add to coefficient
LR <- 0                           # to kick start our while loop 
while(LR < cut){
  cf <- cf + e
  num <- logLik(glm(SF ~ sex + offset(cf*ldose), family = binomial))
  LR <- -2*(num - den)
(upper <- cf)
##    ldose 
## 1.339214

To begin we save the original coefficient to cf, store the cutoff value to cut, define our increment of 0.001 as e, and set LR to an initial value of 0. In the loop we increment our coefficient estimate which is used in the offset function in the estimation step. There we extract the log likelihood and then calculate LR. If LR is less than cut (3.84), the loop starts again with a new coefficient that is 0.001 higher. We see that our upper bound of 1.339214 is very close to what we got above using confint (1.3390581). If we set e to smaller values we'll get closer.

We can find the LR profile lower bound in a similar way. Instead of adding the increment we subtract it:

cf <- budworm.lg$coefficients[3]  # reset cf
LR <- 0                           # reset LR 
while(LR < cut){
  cf <- cf - e
  num <- logLik(glm(SF ~ sex + offset(cf*ldose), family = binomial))
  LR <- -2*(num - den)
(lower <- cf)
##    ldose 
## 0.822214

The result, 0.822214, is very close to the lower bound we got from confint (0.8228708).

This is a very basic implementation of calculating a likelihood ratio confidence interval. It is only meant to give a general sense of what's happening when you see that message Waiting for profiling to be done.... I hope you found it helpful. To see how R does it, enter getAnywhere(profile.glm) in the console and inspect the code. It's not for the faint of heart.

I have to mention the book Analysis of Categorical Data with R, from which I gained a better understanding of the material in this post. The authors have kindly shared their R code at the following web site if you want to have a look: http://www.chrisbilder.com/categorical/

To see how they “manually” calculate likelihood ratio confidence intervals, go to the following R script and see the section “Examples of how to find profile likelihood ratio intervals without confint()”: http://www.chrisbilder.com/categorical/Chapter2/Placekick.R

A Logistic Regression Checklist

I recently read The Checklist Manifesto by Atul Gawande and was fascinated by how relatively simple checklists can improve performance and results in such complex endeavors as surgery or flying a commercial airplane. I decided I wanted to make a checklist of my own for Logistic regression. It ended up not being a checklist on how to do it per se, but rather a list of important facts to remember. Here’s what I came up with.

  • Logistic regression models the probability that y = 1, P(y_{i} = 1) = logit^{-1}(X_{i}\beta) where logit^{-1} = \frac{e^{x}}{1+e^{x}}
  • Logistic predictions are probabilistic. It predicts a probability that y = 1. It does not make a point prediction.
  • The function logit^{-1} = \frac{e^{x}}{1+e^{x}} transforms continuous values to the range (0,1).
  • Dividing a regression coefficient by 4 will give an upper bound of the predictive difference corresponding to a unit difference in x. For example if \beta = 0.33, then 0.33/4 = 0.08. This means a unit increase in x corresponds to no more than a 8% positive difference in the probability that y = 1.
  • The odds of success (i.e., y = 1) increase multiplicatively by e^{\beta} for every one-unit increase in x. That is, exponentiating logistic regression coefficients can be interpreted as odds ratios. For example, let’s say we have a regression coefficient of 0.497. Exponentiating gives e^{0.497} = 1.64 . That means the odds of success increase by 64% for each one-unit increase in x. Recall that odds = \frac{p}{1-p} . If our predicted probability at x is 0.674, then the odds of success are \frac{0.674}{0.326} = 2.07 . Therefore at x + 1, the odds will increase by 64% from 2.07 to 2.07(1.64) = 3.40. Notice that 1.64 = \frac{3.40}{2.70}, which is an odds ratio. The ratio of odds of x + 1 to x will always be e^{\beta}, where \beta is a logistic regression coefficient.
  • Plots of raw residuals from logistic regression are generally not useful. Instead it’s preferable to plot binned residuals “by dividing the data into categories (bins) based on their fitted values, and then plotting the average residual versus the average fitted value for each bin.” (Gelman & Hill, p. 97). Example R code for doing this can be found here.
  • The error rate is the proportion of cases in your model that predicts y = 1 when the case is actually y = 0 in the data. We predict y = 1 when the predicted probability exceeds 0.5. Otherwise we predict y = 0. It’s not good if your error rate equals the null rate. The null rate is usually the proportion of 0’s in your data. In other words, if you guessed all cases in your data are y = 1, then the null rate is the percentage you guessed wrong. Let’s say your data has 58% of y = 1 and 42% of y = 0, then the null rate is 42%. Further, let’s say you do some logistic regression on this data and your model has an error rate of 36%. That is, 36% of the time it predicts the wrong outcome. This means your model does only 4% better than simply guessing that all cases are y = 1.
  • Deviance is a measure of error. Lower deviance is better. When an informative predictor is added to a model, we expect deviance to decrease by more than 1. If not, then we’ve likely added a non-informative predictor to the model that just adds noise.
  • If a predictor x is completely aligned with the outcome so that y = 1 when x is above some threshold and y = 0 when x is below some threshold, then the coefficient estimate will explode to some gigantic value. This means the parameter cannot be estimated. This is an identifiability problem called separation.

Most of this comes from Chapter 5 of Data Analysis Using Regression and Multilevel/Hierarchical Models by Gellman and Hill. I also pulled a little from chapter 5 of An Introduction to Categorical Data Analysis by Agresti.

Logistic Regression

Let’s pretend I take a random sample of 1200 American male adults and I ask them two questions:

  1. What’s your annual salary?
  2. Do you watch NFL football?

Further, let’s pretend they all answer, and they all answer honestly! Now I want to know if a person’s annual salary is associated with whether or not they watch NFL football. In other words, does the answer to the first question above help me predict the answer to the second question? In statistical language, I want to know if the continuous variable (salary) is associated with the binary variable (NFL? yes or no). Logistic regression helps us answer this question.

Before I go on I must say that whole books are devoted to logistic regression. I don’t intend to explain all of logistic regression in this one post. But I do hope to illustrate it with a simple example and hopefully provide some insight to what it does.

In classic linear regression, the response is continuous. For example, does a person’s weight help me predict diastolic blood pressure? The response is diastolic blood pressure and it’s a continuous variable. A good way to picture linear regression is to think of a scatter plot of two variables and imagine a straight line going through the dots that best represents the association between the two variables. Like this:

In my NFL/salary example, the response is “yes” or “no”. We can translate that to numbers as “yes”=1 and “no”=0.  Now we have a binary response. It can only take two values: 1 or 0. If we do a scatter plot of the two variables salary and NFL, we get something like this:

Trying to draw a line through that to describe association between salary and 0/1 seems pretty dumb. Instead what we do is try to predict the probability of taking a 0 or 1. And that’s the basic idea of logistic regression: develop a model for predicting the probability of a binary response. So we do eventually have “a line” we draw through the plot above, but it’s a smooth curve in the shape of an “S” that depicts increasing (or decreasing) probability of taking a 0 or 1 value.

The logistic model is as follows:

Pr(y=1) = logit^{-1}(\alpha + x \beta)

where logit^{-1}(x) = \frac{e^{x}}{1+e^{x}}

So we use statistical software to find the \alpha and \beta values (i.e., the coefficients), multiply it by our predictor (in our example, salary), and plug in to the inverse logit function to get a predicted probability of taking a 0 or 1. Let’s do that.

I generated some data in R as follows:

#simulate salaries between $23,000 and $100,000
salary <- round(runif(1200,23,100))

#simulate yes/no response to watching NFL
nfl <- c()
for (i in 1:1200){
    if (salary[i] < 61) {nfl[i] <- rbinom(1,1,0.75)}
    else {nfl[i] <- rbinom(1,1,0.25)}

The salaries are random in the range between 23 and 100. Obviously they represent thousands of dollars. The NFL response is not totally random. I rigged it so that a person making more money is less likely to watch the NFL. I have no idea if that's true or not, nor do I care. I just wanted to generate data with some association. Using R we can carry out the logistic regression as follows:

fit <- glm (nfl ~ salary, family=binomial(link="logit"))

We get the following results:

\alpha = 2.88 and \beta= -0.05.

Trust me when I say that those coefficients are statistically significant. Our model is Pr(NFL=1) = logit^{-1}(2.88 - 0.05 \times salary).

Now we can use them to predict probabilities. What's the probability someone earning $40,000/year watches the NFL?

\frac{e^{2.88 -0.05(40)}}{1 + e^{2.88 -0.05(40)}} = 0.73

What's the probability someone earning $80,000/year watches the NFL?

\frac{e^{2.88 -0.05(80)}}{1 + e^{2.88 -0.05(80)}} = 0.30

I'm leaving out tons of math and theory here. As I said at the outset, there's much more to logistic regression. But what I just demonstrated is the basic idea. With our model we can now draw a line through our plot:

The line represents probability. It decreases as salary increases. People with higher salaries are less likely to say they watch the NFL. At least according to the world of my made up data.