Most useful tests for an ANCOVA model

In his book Regression Modeling Strategies, 2nd Ed, Frank Harrell provides a list of what he calls the “most useful tests” for a 2-level factor $$\times$$ numeric model (Table 2.2, p. 19). This is often called an Analysis of Covariance, or ANCOVA. The basic idea is we have a numeric response with substantial variability and we seek to understand the variability by modeling the mean of the response as a function of a categorical variable and a numeric variable.

Let’s simulate some data for such a model and then see how we can use R to carry out these tests.

n <- 400
set.seed(1)
sex <- factor(sample(x = c("f", "m"), size = n, replace = TRUE))
age <- round(runif(n = n, min = 18, max = 65))
y <- 1 + 0.8*age + 0.4*(sex == "m") - 0.7*age*(sex == "m") + rnorm(n, mean = 0, sd = 8)
dat <- data.frame(y, age, sex)


The data contain a numeric response, y, that is a function of age and sex. I set the “true” coefficient values to 1, 0.8, 0.4, and -0.7. They correspond to $$\beta_0$$ through $$\beta_3$$ in the following model:

$y = \beta_0 + \beta_1 age + \beta_2 sex + \beta_3 age \times sex$

In addition the error component is a Normal distribution with a standard deviation of 8.

Now let’s model the data and see how close we get to recovering the true parameter values.

mod <- lm(y ~ age * sex, dat)
summary(mod)

##
## Call:
## lm(formula = y ~ age * sex, data = dat)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -23.8986  -5.8552  -0.2503   6.0507  30.6188
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.27268    1.93776   0.141    0.888
## age          0.79781    0.04316  18.484   <2e-16 ***
## sexm         2.07143    2.84931   0.727    0.468
## age:sexm    -0.72702    0.06462 -11.251   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.661 on 396 degrees of freedom
## Multiple R-squared:  0.7874, Adjusted R-squared:  0.7858
## F-statistic:   489 on 3 and 396 DF,  p-value: < 2.2e-16


While the coefficient estimates for age and the age $$\times$$ sex interaction are pretty close to the true values, the same cannot be said for the intercept and sex coefficients. The residual standard error of 8.661 is close to the true value of 8.

We can see in the summary output of the model that four hypothesis tests, one for each coefficient, are carried out for us. Each are testing if the coefficient is equal to 0. Of those four, only one qualifies as one of the most useful tests: the last one for age:sexm. This tests if the effect of age is independent of sex and vice versa. Stated two other ways, it tests if age and sex are additive, or if the age effect is the same for both sexes. To get a better understanding of what we’re testing, let’s plot the data with fitted age slopes for each sex.

library(ggplot2)
ggplot(dat, aes(x = age, y = y, color = sex)) +
geom_point() +
geom_smooth(method="lm")


Visually it appears the effect of age is not independent of sex. It seems more pronounced for females. Is this effect real or maybe due to chance? The hypothesis test in the summary output for age:sexm evaluates this. Obviously the effect seems very real. We are not likely to see such a difference in slopes this large if there truly was no difference. It does appear the effect of age is different for each sex. The estimate of -0.72 estimates the difference in slopes (or age effect) for the males and females.

The other three hypothesis tests are not very useful.

• Testing if the Intercept is 0 is testing whether y is 0 for females at age 0.
• Testing if age is 0 is testing whether age is associated with y for males.
• Testing if sexm is 0 is testing whether sex is associated with y for subjects at age 0.

Other more useful tests, as Harrell outlines in Table 2.2, are as follows:

• Is age associated with y?
• Is sex associated with y?
• Are either age or sex associated with y?

The last one is answered in the model output. That’s the F-statistic in the last line. It tests whether all coefficients (except the intercept) are equal to 0. The result of this test is conclusive. At least one of the coeffcients is not 0.

To test if age is associated with y, we need to test if both the age and age:sexm coefficents are equal to 0. The car package by John Fox provides a nice function for this purpose called linearHypothesis. It takes at least two arguments. The first is the fitted model object and the second is a vector of hypothesis tests. Below we specify we want to test if “age = 0” and “age:sexm = 0”

library(car)
linearHypothesis(mod, c("age = 0", "age:sexm = 0"))

## Linear hypothesis test
##
## Hypothesis:
## age = 0
## age:sexm = 0
##
## Model 1: restricted model
## Model 2: y ~ age * sex
##
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)
## 1    398 55494
## 2    396 29704  2     25790 171.91 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


The result is once again conclusive. The p-value is virtually 0. It does indeed appear that age is associated with y.

Likewise, to test if sex is associated with y, we need to test if both the sex and age:sexm coefficents are equal to 0.

linearHypothesis(mod, c("sexm = 0", "age:sexm = 0"))

## Linear hypothesis test
##
## Hypothesis:
## sexm = 0
## age:sexm = 0
##
## Model 1: restricted model
## Model 2: y ~ age * sex
##
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)
## 1    398 119354
## 2    396  29704  2     89651 597.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


As expected this test confirms that sex is associated with y, just as we specified when we simulated the data.

Now that we have established that age is associated with y, and that the association differs for each sex, what exactly is that association for each sex? In other words what are the slopes of the lines in our plot above?

We can sort of answer that with the model coefficients.

round(coef(mod),3)

## (Intercept)         age        sexm    age:sexm
##       0.273       0.798       2.071      -0.727


That corresponds to the following model:

$y = 0.273 + 0.799 age + 2.071 sex – 0.727 age \times sex$

When sex is female, the fitted model is

$y = 0.273 + 0.799 age$

This says the slope of the age is about 0.8 when sex is female.

When sex is male, the fitted model is

$y = (0.273 + 2.071) + (0.797 – 0.727) age$
$y = 2.344 + 0.07 age$

This says the slope of the age is about 0.07 when sex is male.

How certain are we about these estimates? That’s what standard error is for. For the age slope estimate for females the standard error is provided in the model output for the age coefficient. It shows about 0.04. Adding and subtracting 2 $$\times$$ 0.04 to the coefficient gives us a rough 95% confidence interval. Or we could just use the confint function:

confint(mod, parm = "age")

##         2.5 %    97.5 %
## age 0.7129564 0.8826672


The standard error of the age slope estimate for males takes a little more work. Another car function useful for this is the deltaMethod function. It takes at least three arguments: the model object, the quantity expressed as a character phrase that we wish to estimate a standard error for, and the names of the parameters. The function then calculates the standard error using the delta method. Here’s one way to do it for our model

deltaMethod(mod, "b1 + b3", parameterNames = paste0("b", 0:3))

##           Estimate         SE       2.5 %    97.5 %
## b1 + b3 0.07079277 0.04808754 -0.02345709 0.1650426


The standard error is similar in magnitude, but since our estimate is so small the resulting confidence interval overlaps 0. This tells us the effect of age on males is too small for our data to determine if the effect is positive or negative.

Another way to get the estimated age slopes for each sex, along with standard errors and confidence intervals, is to use the margins package. We use the margins function with our model object and specify that we want to estimate the marginal effect of age at each level of sex. (“marginal effect of age” is another way of saying the effect of age at each level of sex)

library(margins)
margins(mod, variables = "age", at = list(sex = c("f", "m")))

## Average marginal effects at specified values

## lm(formula = y ~ age * sex, data = dat)

##  at(sex)     age
##        f 0.79781
##        m 0.07079


This does the formula work we did above. It plugs in sex and returns the estmimated slope coefficient for age. If we wrap the call in summary we get the standard errors and confidence intervals.

summary(margins(mod, variables = "age", at = list(sex = c("f", "m"))))

##  factor    sex    AME     SE       z      p   lower  upper
##     age 1.0000 0.7978 0.0432 18.4841 0.0000  0.7132 0.8824
##     age 2.0000 0.0708 0.0481  1.4722 0.1410 -0.0235 0.1650


Using natural splines in linear modeling

Take a look at this scatterplot:

It's clear there is a relationship between x and y, but the relationship is non-linear. How can we fit a linear model to this data? We could try fitting a polynomial model. The relationship seems to “change directions” four different times, so we could try fitting a 4th-degree polynomial model.

modp <- lm(y ~ poly(x, 4))
plot(x,y)
lines(x, fitted(modp))


This sort of captures the general nature of the relationship but the peaks and valleys just aren't quite right. They either over-predict or under-predict. We would like to do better.

Another approach might be to use a nonparametric regression approach such as loess. If we set the span parameter to 0.5, which controls the amount of smoothing, we get a decent fitting model:

modl <- loess(y ~ x, span = 0.5)
plot(x, y)
lines(x, predict(modl))


This matches what we get when we use ggplot with the smooth geom:

library(ggplot2)
ggplot(data.frame(x, y), aes(x, y)) +
geom_point() +
geom_smooth(se = F, span = 0.5)


But the drawback is we have no prediction equation. This is a non-parametric approach, hence no parameters were estimated.

This leads us to restricted cubic splines, or natural splines. The basic idea is to model a non-linear relationship such as the one in our example with piecewise cubic polynomials. Let's go ahead and first use natural splines in our linear model and then talk a little more about what's happening behind the scenes. Below we first load the splines package (a recommended package that comes with base R) so we have access to the ns function (natural splines). Notice we call ns on our predictor and specify df as 4. Specifying df = 4 implies 3 interior knots (ie, not including two “boundary knots”).

library(splines)
modns <- lm(y ~ ns(x, df = 4))
plot(x, y)
lines(x, predict(modns))


This looks better than the polynomial model. And unlike the loess fit, we have a prediction equation:

summary(modns)

##
## Call:
## lm(formula = y ~ ns(x, df = 4))
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -6.355 -1.302 -0.052  1.279  5.325
##
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)      0.6489     0.8242   0.787    0.433
## ns(x, df = 4)1  11.9996     1.0508  11.420   <2e-16 ***
## ns(x, df = 4)2  50.1753     1.0438  48.071   <2e-16 ***
## ns(x, df = 4)3  72.0807     2.1079  34.195   <2e-16 ***
## ns(x, df = 4)4  19.5918     0.9794  20.004   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.114 on 95 degrees of freedom
## Multiple R-squared:  0.977,  Adjusted R-squared:  0.976
## F-statistic:  1008 on 4 and 95 DF,  p-value: < 2.2e-16


Obviously the coefficients defy interpretation, but we can work with them as we would any other linear model. For example, we can test for linearity, that is $$H_0: \beta_2 = \beta_3 = \beta_4 = 0$$. In our model that means testing that the last 3 coefficients are equal to 0. The car package provides the powerful linearHypothesis function for this purpose.

library(car)
linearHypothesis(modns, names(coef(modns))[3:5])

## Linear hypothesis test
##
## Hypothesis:
## ns(x, df = 4)2 = 0
## ns(x, df = 4)3 = 0
## ns(x, df = 4)4 = 0
##
## Model 1: restricted model
## Model 2: y ~ ns(x, df = 4)
##
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)
## 1     98 18311.8
## 2     95   424.4  3     17887 1334.7 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


It's no surprise that the test is highly significant. We can also verify our model with natural splines is superior to the polynomial model via AIC. (Recall a lower AIC is better.)

AIC(modp, modns)

##       df      AIC
## modp   6 521.2002
## modns  6 440.3375


Now that we have played with natural splines, let's back up and try to get a better understanding of what's going on. To begin with, here's how we generated the data:

x <- 1:100         # independent variable
k <- c(25, 50, 75) # 3 interior knots

# function to construct variables x2, x3, x4
u <- function(x)ifelse(x > 0, x, 0)

x2 <- u(x - k[1])
x3 <- u(x - k[2])
x4 <- u(x - k[3])

# generate data
set.seed(1)
y <- 0.8 + 1*x + -1.2*x2 + 1.4*x3 + -1.6*x4 + rnorm(100,sd = 2.2)
plot(x, y)


Our first predictor is x, which is simply the numbers 1 – 100. Next we define 3 “knots” at 25, 50, and 75. We then use those knots to construct three additional variables: x2, x3, and x4.

• x2 is equal to x – 25 when x – 25 is greater than 0
• x3 is equal to x – 50 when x – 50 is greater than 0
• x4 is equal to x – 75 when x – 75 is greater than 0

Finally we generate our dependent variable, y, as a function of x, x2, x3, and x4 plus some noise from a Normal(0, 2.2) distribution. The formula on the right side of the assignment operator is our True model. This is technically called a linear spline function. The best linear model we could fit to the data would be the following:

mod <- lm(y ~ x + x2 + x3 + x4)
summary(mod)

##
## Call:
## lm(formula = y ~ x + x2 + x3 + x4)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -5.0699 -1.2797  0.1007  1.2322  4.9155
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.34673    0.77485   1.738   0.0854 .
## x            0.97506    0.04425  22.035   <2e-16 ***
## x2          -1.14818    0.07094 -16.186   <2e-16 ***
## x3           1.35246    0.06220  21.743   <2e-16 ***
## x4          -1.57494    0.06865 -22.941   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.01 on 95 degrees of freedom
## Multiple R-squared:  0.9792, Adjusted R-squared:  0.9783
## F-statistic:  1117 on 4 and 95 DF,  p-value: < 2.2e-16

plot(x, y)
lines(x, fitted(mod))


So we see that to make the trajectory of our data change directions 4 times, we needed to create 4 predictors, 3 of which were based on one. This might make intuitive sense if we think of a simple parabola. It changes directions twice and has two coefficients for the $$x$$ and $$x^2$$ parameters. It's a 2nd degree polynomial. Likewise for a 3rd degree polynomial, a 4th degree polynomial, and so forth. There's only one $$x$$, but the trajectory of $$y$$ changes depending on the degree of the polynomial.

The takeaway is that when we see that our response variable has a non-linear relationship with a predictor variable in real life, we may need to consider more than just a single slope coefficient to model the relationship. In the example we've been using the slope, or trajectory, changes directions 4 times, which suggests using four predictors instead of one. We tried a 4th degree polynomial of $$x$$ but saw that didn't work as well as we would have liked. A recommended approach then is to try natural splines.

The basic, and I mean very basic, idea of natural splines is to fit a 3rd degree polynomial to data within knots, and then connect those lines together. For example, below is our data with knots defined at 0, 25, 50, 75, and 100.

plot(x,y)
abline(v = c(0,25,50,75,100))


With 5 knots, we have 4 regions of data. Within those 4 regions of data, natural splines essentially allow us to fit 4 different 3rd degree polynomials, all smoothed together. The magic is in how the 3 additional predictors are generated. Using the ns function in the splines package, we can create a basis matrix that allows us to fit a natural cubic spline using regular regression functions such as lm and glm. How the basis matrix is generated is quite complicated and probably something you'll just want to take on faith, like I do.

We can sort of see the natural spline in action if we fit and then color the lines between the knots. Below we regress $$y$$ on a natural spline of $$x$$ with knots defined at 25, 50 and 70. We then color the fitted lines differently between the knots.

plot(x,y)
mod.ns <- lm(y ~ ns(x, knots = c(25,50,75)))
lines(x[1:25], fitted(mod.ns)[1:25],col=1)
lines(x[26:50], fitted(mod.ns)[26:50],col=2)
lines(x[51:75], fitted(mod.ns)[51:75],col=3)
lines(x[76:100], fitted(mod.ns)[76:100],col=4)


Notice the red and green interior lines have a noticeable 3rd degree polynomial shape. The exterior red and blue lines look more like a 2nd degree polynomial. That's because natural splines are constrained to be linear in the tails (ie, the boundary knots). For this reason, natural splines are sometimes called restricted cubic splines. If we didn't want that constraint, we could use the bs function to generate a b-spline matrix basis. Here's what that looks like.

plot(x,y)
mod.bs <- lm(y ~ bs(x, knots = c(25,50,75)))
lines(x[1:25], fitted(mod.bs)[1:25],col=1)
lines(x[26:50], fitted(mod.bs)[26:50],col=2)
lines(x[51:75], fitted(mod.bs)[51:75],col=3)
lines(x[76:100], fitted(mod.bs)[76:100],col=4)


Notice the blue line in particular curls up at the boundary as would be expected of a 3rd degree polynomial. But that's not the only difference. Look at the summary output. Notice the bs function generated matrix with 6 columns instead of 4.

summary(mod.bs)

##
## Call:
## lm(formula = y ~ bs(x, knots = c(25, 50, 75)))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -5.8247 -1.1582 -0.0468  1.2780  5.0283
##
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)
## (Intercept)                      1.983      1.218   1.628  0.10699
## bs(x, knots = c(25, 50, 75))1    6.623      2.281   2.903  0.00461 **
## bs(x, knots = c(25, 50, 75))2   33.328      1.529  21.794  < 2e-16 ***
## bs(x, knots = c(25, 50, 75))3    8.281      1.840   4.501 1.96e-05 ***
## bs(x, knots = c(25, 50, 75))4   60.284      1.722  35.000  < 2e-16 ***
## bs(x, knots = c(25, 50, 75))5   40.697      1.913  21.278  < 2e-16 ***
## bs(x, knots = c(25, 50, 75))6   38.377      1.688  22.736  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.052 on 93 degrees of freedom
## Multiple R-squared:  0.9788, Adjusted R-squared:  0.9774
## F-statistic: 714.2 on 6 and 93 DF,  p-value: < 2.2e-16


So fitting a b-spline means actually fitting a more complicated model. For this reason and the fact that b-splines can be poory behaved in the tails (Harrell, p. 24), most statisticians recommend working with natural splines.

To wrap up, let's go over some guidelines for using natural splines with real data. First off, you don't want to go looking at your data and guessing how many times it changes direction to determine your knots or degrees of freedom! I did that simply as a way to help explain how splines worked. In practice you'll want to determine degrees of freedom based on sample size and how important you suspect a predictor to be. Harrell states that 4 degrees of freedom is usually sufficient. If your sample is on the small side, perhaps choose 3 degrees of freedom. If it's large, go with 5. And notice we're talking about degrees of freedom, not knots. The location of knots isn't that crucial and ns will automatically select knots based on the quantiles of the predcitor.

Something else to remember is that the coefficients on a model with natural splines defy any sort of interpretation. So forget using the “1-unit increase in x leads to a __ increase in y” method to explain association. An alternative approach is an effect plot, which allows you to visualize your model given certain predictor values. Here's a quick demonstration using a simplified example that comes with the powerful effects package. Below we model the log of prestige, a prestige score for someone's occupation, as a function of logged income and a 4 degree of freedom natural spline basis of education. Calling summary and anova on the model object will reveal the natural spline appears warranted and highly significant.

library(effects)
mod.pres1 <- lm(log(prestige) ~ log(income) + ns(education, 4),
data=Prestige)


But what does it mean? What is the association between prestige and education when, say, holding income at the mean value? An effect plot sheds some light.

eff.log <- Effect("education", mod.pres1, transformation=list(inverse=exp))
plot(eff.log)


This shows that from about 9 – 14, the effect of education is pretty dramatic on prestige scores, but rather uncertain in the extremes, below 9 and above 14. From 10 – 14, it looks like a 2-level increase in education is worth about a 10 point increase in prestige scores. We couldn't guess that from the summary output but we can sort of infer it from the effect plot. Again, this is just an example and not a replication of the original analysis.

Reference:

F. Harrell. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd Ed Springer. 2015

Profile likelihood ratio confidence intervals

When you fit a generalized linear model (GLM) in R and call confint on the model object, you get confidence intervals for the model coefficients. But you also get an interesting message:

Waiting for profiling to be done...

What's that all about? What exactly is being profiled? Put simply, it's telling you that it's calculating a profile likelihood ratio confidence interval.

The typical way to calculate a 95% confidence interval is to multiply the standard error of an estimate by some normal quantile such as 1.96 and add/subtract that product to/from the estimate to get an interval. In the context of GLMs, we sometimes call that a Wald confidence interval.

Another way to determine an upper and lower bound of plausible values for a model coefficient is to find the minimum and maximum value of the set of all coefficients that satisfy the following:

$-2\log\left(\frac{L(\beta_{0}, \beta_{1}|y_{1},…,y_{n})}{L(\hat{\beta_{0}}, \hat{\beta_{1}}|y_{1},…,y_{n})}\right) < \chi_{1,1-\alpha}^{2}$

Inside the parentheses is a ratio of likelihoods. In the denominator is the likelihood of the model we fit. In the numerator is the likelihood of the same model but with different coefficients. (More on that in a moment.) We take the log of the ratio and multiply by -2. This gives us a likelihood ratio test (LRT) statistic. This statistic is typically used to test whether a coefficient is equal to some value, such as 0, with the null likelihood in the numerator (model without coefficient, that is, equal to 0) and the alternative or estimated likelihood in the denominator (model with coefficient). If the LRT statistic is less than $$\chi_{1,0.95}^{2} \approx 3.84$$, we fail to reject the null. The coefficient is statisically not much different from 0. That means the likelihood ratio is close to 1. The likelihood of the model without the coefficient is almost as high the model with it. On the other hand, if the ratio is small, that means the likelihood of the model without the coefficient is much smaller than the likelihood of the model with the coefficient. This leads to a larger LRT statistic since it's being log transformed, which leads to a value larger than 3.84 and thus rejection of the null.

Now in the formula above, we are seeking all such coefficients in the numerator that would make it a true statement. You might say we're “profiling” many different null values and their respective LRT test statistics. Do they fit the profile of a plausible coefficient value in our model? The smallest value we can get without violating the condition becomes our lower bound, and likewise with the largest value. When we're done we'll have a range of plausible values for our model coefficient that gives us some indication of the uncertainly of our estimate.

Let's load some data and fit a binomial GLM to illustrate these concepts. The following R code comes from the help page for confint.glm. This is an example from the classic Modern Applied Statistics with S. ldose is a dosing level and sex is self-explanatory. SF is number of successes and failures, where success is number of dead worms. We're interested in learning about the effects of dosing level and sex on number of worms killed. Presumably this worm is a pest of some sort.

# example from Venables and Ripley (2002, pp. 190-2.)
ldose <- rep(0:5, 2)
numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
sex <- factor(rep(c("M", "F"), c(6, 6)))
budworm.lg <- glm(SF ~ sex + ldose, family = binomial)
summary(budworm.lg)

##
## Call:
## glm(formula = SF ~ sex + ldose, family = binomial)
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -1.10540  -0.65343  -0.02225   0.48471   1.42944
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -3.4732     0.4685  -7.413 1.23e-13 ***
## sexM          1.1007     0.3558   3.093  0.00198 **
## ldose         1.0642     0.1311   8.119 4.70e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 124.8756  on 11  degrees of freedom
## Residual deviance:   6.7571  on  9  degrees of freedom
## AIC: 42.867
##
## Number of Fisher Scoring iterations: 4


The coefficient for ldose looks significant. Let's determine a confidence interval for the coefficient using the confint function. We call confint on our model object, budworm.lg and use the parm argument to specify that we only want to do it for ldose:

confint(budworm.lg, parm = "ldose")

## Waiting for profiling to be done...

##     2.5 %    97.5 %
## 0.8228708 1.3390581


We get our “waiting” message though there really was no wait. If we fit a larger model and request multiple confidence intervals, then there might actually be a waiting period of a few seconds. The lower bound is about 0.8 and the upper bound about 1.32. We might say every increase in dosing level increase the log odds of killing worms by at least 0.8. We could also exponentiate to get a CI for an odds ratio estimate:

exp(confint(budworm.lg, parm = "ldose"))

## Waiting for profiling to be done...

##    2.5 %   97.5 %
## 2.277027 3.815448


The odds of “success” (killing worms) is at least 2.3 times higher at one dosing level versus the next lower dosing level.

To better understand the profile likelihood ratio confidence interval, let's do it “manually”. Recall the denominator in the formula above was the likelihood of our fitted model. We can extract that with the logLik function:

den <- logLik(budworm.lg)
den

## 'log Lik.' -18.43373 (df=3)


The numerator was the likelihood of a model with a different coefficient. Here's the likelihood of a model with a coefficient of 1.05:

num <- logLik(glm(SF ~ sex + offset(1.05*ldose), family = binomial))
num

## 'log Lik.' -18.43965 (df=2)


Notice we used the offset function. That allows us to fix the coefficient to 1.05 and not have it estimated.

Since we already extracted the log likelihoods, we need to subtract them. Remember this rule from algebra?

$\log\frac{M}{N} = \log M – \log N$

So we subtract the denominator from the numerator, multiply by -2, and check if it's less than 3.84, which we calculate with qchisq(p = 0.95, df = 1)

-2*(num - den)

## 'log Lik.' 0.01184421 (df=2)

-2*(num - den) < qchisq(p = 0.95, df = 1)

## [1] TRUE


It is. 1.05 seems like a plausible value for the ldose coefficient. That makes sense since the estimated value was 1.0642. Let's try it with a larger value, like 1.5:

num <- logLik(glm(SF ~ sex + offset(1.5*ldose), family = binomial))
-2*(num - den) < qchisq(p = 0.95, df = 1)

## [1] FALSE


FALSE. 1.5 seems too big to be a plausible value for the ldose coefficient.

Now that we have the general idea, we can program a while loop to check different values until we exceed our threshold of 3.84.

cf <- budworm.lg$coefficients[3] # fitted coefficient 1.0642 cut <- qchisq(p = 0.95, df = 1) # about 3.84 e <- 0.001 # increment to add to coefficient LR <- 0 # to kick start our while loop while(LR < cut){ cf <- cf + e num <- logLik(glm(SF ~ sex + offset(cf*ldose), family = binomial)) LR <- -2*(num - den) } (upper <- cf)  ## ldose ## 1.339214  To begin we save the original coefficient to cf, store the cutoff value to cut, define our increment of 0.001 as e, and set LR to an initial value of 0. In the loop we increment our coefficient estimate which is used in the offset function in the estimation step. There we extract the log likelihood and then calculate LR. If LR is less than cut (3.84), the loop starts again with a new coefficient that is 0.001 higher. We see that our upper bound of 1.339214 is very close to what we got above using confint (1.3390581). If we set e to smaller values we'll get closer. We can find the LR profile lower bound in a similar way. Instead of adding the increment we subtract it: cf <- budworm.lg$coefficients[3]  # reset cf
LR <- 0                           # reset LR
while(LR < cut){
cf <- cf - e
num <- logLik(glm(SF ~ sex + offset(cf*ldose), family = binomial))
LR <- -2*(num - den)
}
(lower <- cf)

##    ldose
## 0.822214


The result, 0.822214, is very close to the lower bound we got from confint (0.8228708).

This is a very basic implementation of calculating a likelihood ratio confidence interval. It is only meant to give a general sense of what's happening when you see that message Waiting for profiling to be done.... I hope you found it helpful. To see how R does it, enter getAnywhere(profile.glm) in the console and inspect the code. It's not for the faint of heart.

I have to mention the book Analysis of Categorical Data with R, from which I gained a better understanding of the material in this post. The authors have kindly shared their R code at the following web site if you want to have a look: http://www.chrisbilder.com/categorical/

To see how they “manually” calculate likelihood ratio confidence intervals, go to the following R script and see the section “Examples of how to find profile likelihood ratio intervals without confint()”: http://www.chrisbilder.com/categorical/Chapter2/Placekick.R

Poisson regression – Ch 6 of Gelman and Hill

Chapter 6 of Gelman and Hill's Data Analysis Using Regression and Multilevel/Hierarchical Models presents an interesting example of Poisson regression using data on police stops in New York. Put simply, they seek to model the count of “stop and frisks” as a function of ethnicity and precinct with number of arrests in the previous year included as an offset. While the example is fun and informative, it's not terribly easy to replicate. The R code is provided to run the regressions but the data is a little hard to locate. I was eventually able to find it and get everything to mostly work, so I thought I would blog about it in case anyone else was trying to do the same.

About the data: it turns out you can find it here in the “police” folder. In the folder is a dat file called “frisk_with_noise”. Here's one way to read it in:

url <- "http://www.stat.columbia.edu/~gelman/arm/examples/police/frisk_with_noise.dat"

##   stops  pop past.arrests precinct eth crime
## 1    75 1720          191        1   1     1
## 2    36 1720           57        1   1     2
## 3    74 1720          599        1   1     3
## 4    17 1720          133        1   1     4

dim(dat)

## [1] 900   6


But wait. You'll notice the data set has 900 rows but the regression output in the book references n = 225. To replicate the example in the book, I'm pretty sure we need to aggregate over precinct and eth, like so:

stops <- aggregate(cbind(stops, past.arrests) ~ eth + precinct, data=dat, sum)


Using cbind() on stops and past.arrests allows us to sum both stops and past.arrests over all combinations of eth and precinct. Now we're ready to run the code provided in the book, which can also be found here. Here's one of the models they fit:

library(arm) # for the display() function
fit.2 <- glm (stops ~ factor(eth), data=stops, family=poisson, offset=log(past.arrests))
display(fit.2)

## glm(formula = stops ~ factor(eth), family = poisson, data = stops,
##     offset = log(past.arrests))
##              coef.est coef.se
## (Intercept)  -0.59     0.00
## factor(eth)2  0.07     0.01
## factor(eth)3 -0.16     0.01
## ---
##   n = 225, k = 3
##   residual deviance = 45437.4, null deviance = 46120.3 (difference = 682.9)


That works, but our output doesn't match the book's output! What's going on? Actually the explanation is in the title of the data file: “frisk_with_noise”. Also notice how I skipped the first 6 lines of the file when reading it in to R. Here's the first line of what I skipped:

stop and frisk data (with noise added to protect confidentiality)


So noise was apparently added after publication of the book to protect confidentiality. I guess we're not supposed to see which precincts are making the most stops of certain ethnicities? Oh well, at least we can run the code now.

We can see fitted values using the fitted() function. Here are the first 10 fitted values compared to their observed values:

cbind(observed=stops$stops[1:10], fitted=fitted(fit.2)[1:10])  ## observed fitted ## 1 202 544.2816 ## 2 102 175.7562 ## 3 81 180.0316 ## 4 132 418.2082 ## 5 144 331.8515 ## 6 71 203.6578 ## 7 752 1215.1920 ## 8 441 373.5563 ## 9 410 584.9845 ## 10 385 261.5884  Doesn't look too good to the eye, at least not the first 10. How are those fitted values calculated? Like this: $y = exp(\alpha + \beta x + log(t))$ where t is the offset, in this case, past arrrests. So to calculate the fitted value for precinct 1, ethnicity = 1 (black), and 980 arrests in the previous year, we do the following: exp(coef(fit.2)[1] + log(980))  ## (Intercept) ## 544.2816  # or equivalently 980*exp(coef(fit.2)[1])  ## (Intercept) ## 544.2816  # compare to R's fitted value fitted(fit.2)[1] == exp(coef(fit.2)[1] + log(980))  ## 1 ## TRUE  Plotting standardized residuals versus fitted values can reveal overdispersion. That's what Gelman and Hill do in figure 6.1. Below I reproduce it. First I fit a new model that includes precinct as a predictor. The figure on the left plots raw residuals versus fitted values. This is of limited use since we expect variance to increase with larger fitted values in a Poisson distribution. In other words we wouldn't expect to see a constant scatter about 0 as we would in OLS regression. The figure on the right, however, uses standardized residuals which have a mean of 0 and standard deviation of 1. If the Poisson model is true, we would expect to see most residuals falling within 2 standard deviations from 0. Many residuals falling beyond this range reveal overdispersion. And indeed that's what we see here. fit.3 <- glm (stops ~ factor(eth) + factor(precinct), data=stops, family=poisson, offset=log(past.arrests)) par(mfrow=c(1,2)) pv <- fitted(fit.3) r <- (stops$stops - fitted(fit.3))
plot(pv, r, pch=20, ylab="raw residuals", xlab="predicted value")
abline(h=0)
sr <- (stops$stops - fitted(fit.3))/sqrt(fitted(fit.3)) plot(pv, sr, pch=20, ylab="standardized residuals", xlab="predicted value") abline(h=c(-2,0,2),lty=c(2,1,2))  par(mfrow=c(1,1))  To correct for overdisperson in R you can repeat the code for fit.3 above with family = quasipoisson. This corrects the standard errors of the regression coefficients by making them larger. Partial Residuals and the termplot function Earlier this year I taught an Intro to R Graphics workshop. While preparing for it I came across the termplot function. I had no idea it existed. Of course now that I know about it, I come across it fairly regularly. (That's the Baader-Meinhof Phenomenon for you.) ?termplot will tell you what it does: “Plots regression terms against their predictors, optionally with standard errors and partial residuals added.” If you read on, you'll see “Nothing sensible happens for interaction terms, and they may cause errors.” Duly noted. Let's try this out and see what it does. First we'll fit a model using the mtcars dataset that comes with R. Below mpg = Miles/(US) gallon, drat = Rear axle ratio, qsec = ¼ mile time, disp = Displacement (cu.in.), and wt = Weight (lb/1000). lm.cars <- lm(mpg ~ wt + disp + drat + qsec, data=mtcars) summary(lm.cars)  ## ## Call: ## lm(formula = mpg ~ wt + disp + drat + qsec, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.143 -1.933 -0.149 0.919 5.541 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.03223 9.58829 0.94 0.35454 ## wt -4.86768 1.20872 -4.03 0.00041 *** ## disp 0.00522 0.01105 0.47 0.64021 ## drat 1.87086 1.32462 1.41 0.16926 ## qsec 1.05248 0.34778 3.03 0.00539 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.6 on 27 degrees of freedom ## Multiple R-squared: 0.838, Adjusted R-squared: 0.814 ## F-statistic: 35 on 4 and 27 DF, p-value: 2.55e-10  par(mfrow=c(2,2)) termplot(lm.cars)  Compare the coefficients in the summary output to the slopes of the lines in the graphs. See how the coefficients match the slopes of the lines? That's because they are the slopes of the lines. For example, wt has a coefficient of about -5. Likewise the termplot for that coefficient has a negative slope that appears to be about -5. Using these plots we can see at a glance the respective contributions of the predictors. wt and qsec seem to contribute to the model due to the steepness of their slopes, but disp and drat not so much. This matches the summary output that shows wt and qsec as being significant. Now termplot also has an argument called partial.resid that will add partial residuals if set to TRUE. Let's try that out, with color set to “purple” since the default gray can be hard to see (at least to my eyes): par(mfrow=c(2,2)) termplot(lm.cars, partial.resid=TRUE, col.res = "purple")  How are these points generated? The x-axis is clear enough. That's just the predictor values. But the y-axis says “Partial for [variable name]”, i.e. partial residuals. What does that mean? Put as simply as possible, partial residuals are the sum of the residuals and predictor terms. Hopefully you know what residuals are. Those are just the difference between the observed and fitted response values. “Predictor terms” is a little trickier. To understand them, it helps to know that $y = \bar{y} + b_{1}(x_{1} – \bar{x}_{1}) + b_{2}(x_{2} – \bar{x}_{2}) + b_{3}(x_{3} – \bar{x}_{3}) + b_{4}(x_{4} – \bar{x}_{4}) + e$ is another way of writing $y = b_{0} + b_{1}x_{1} + b_{2}x_{2}+ b_{3}x_{3}+ b_{4}x_{4} + e$ So the first equation above has $$\bar{y}$$ as the intercept and centered predictors. If we center the predictors and fit a model, we'll see that's precisely what we get: # center the variables mtcars <- transform(mtcars, wtC = wt-mean(wt), dispC = disp-mean(disp), dratC=drat-mean(drat), qsecC=qsec-mean(qsec)) # now fit model using centered predictors lm.terms.cars <- lm(mpg ~ wtC + dispC + dratC + qsecC, data=mtcars) coef(lm.terms.cars)[1]  ## (Intercept) ## 20.09  mean(mtcars$mpg)

## [1] 20.09


Now if we take the centered predictor values and multiply them by their respective coefficients, we get what are referred to in R as terms. For example, the terms for the first record in the data set are as follows:

# take all coefficients but the intercept...
coef(lm.terms.cars)[-1]

##       wtC     dispC     dratC     qsecC
## -4.867682  0.005222  1.870855  1.052478

# ...and first record of data...
lm.terms.cars$model[1,-1]  ## wtC dispC dratC qsecC ## Mazda RX4 -0.5972 -70.72 0.3034 -1.389  # ...and multiply: coef(lm.terms.cars)[-1]*t(lm.terms.cars$model[1,-1])

##       Mazda RX4
## wtC      2.9072
## dispC   -0.3693
## dratC    0.5677
## qsecC   -1.4616


So for the first record, the term due to wt is 2.91, the term due to disp is -0.37, and so on. These are the predictor terms I mentioned earlier. These are what we add to the residuals to make partial residuals.

You'll be happy to know we don't have to center predictors and fit a new model to extract terms. R can do that for us on the original model object when you specify type="terms" with the predict function, like so:

# notice we're calling this on the original model object, lm.cars
pterms <- predict(lm.cars, type="terms")
pterms[1,] # compare to above; the same

##      wt    disp    drat    qsec
##  2.9072 -0.3693  0.5677 -1.4616


To calculate the partial residuals ourselves and make our own partial residual plots, we can do this:

# add residuals to each column of pterms (i.e., the terms for each predictor)
partial.residuals <- apply(pterms,2,function(x)x+resid(lm.cars))
# create our own termplot of partial residuals
par(mfrow=c(2,2))
# the model in lm.cars includes the response in first column, so we index with i + 1
for(i in 1:4){
plot(x=lm.cars$model[,(i+1)],y=partial.residuals[,i], col="purple", ylim=c(-10,10)) abline(lm(partial.residuals[,i] ~ lm.cars$model[,(i+1)]), col="red")
}


So that's what termplot does for us: it takes the terms for each predictor, adds the residuals to the terms to create partial residuals, and then plots partial residuals versus their respective predictor, (if you specify partial.resid=TRUE). And of course it plots a fitted line, the result of regressing the predictor's partial residuals on itself. Recall ealier I mentioned the slope of the line in the termplot graphs is the coefficient in the summary output. We can indeed verify that:

# coefficient for wt
coef(lm.cars)[2]

##     wt
## -4.868

coef(lm(partial.residuals[,1] ~ mtcars$wt))[2]  ## mtcars$wt
##    -4.868


Ultimately what this whole partial residual/termplot thing is doing is splitting the response value into different parts: an overall mean, a term that is due to wt, a term that is due to disp, a term that is due to drat, and a term that is due to qsec. And you have the residual. So when we create the partial residual for wt, what we're doing is adding the wt term and the residual. This sum accounts for the part of the response not explained by the other terms.

Again let's look at the overall mean, the terms for the first observation and its residual:

# overall mean of response
mean(mtcars$mpg)  ## [1] 20.09  # terms of first observation pterms[1,]  ## wt disp drat qsec ## 2.9072 -0.3693 0.5677 -1.4616  # the residual of the first observation resid(lm.cars)[1]  ## Mazda RX4 ## -0.7346  If we add those up, we get the observed value # add overall mean, terms and residual for first observation mean(mtcars$mpg) + sum(pterms[1,]) + resid(lm.cars)[1]

## Mazda RX4
##        21

# same as observed in data
mtcars$mpg[1]  ## [1] 21  So the wt term value of 2.9072 represents the part of the response that is not explained by the other terms. Finally, we can add another argument to termplot, smooth=panel.smooth that will draw a smooth lowess line through the points. This can help us assess departures from linearity. par(mfrow=c(2,2)) termplot(lm.cars, partial.resid=TRUE, col.res = "purple", smooth=panel.smooth)  Notice how the dashed line bends around the straight line for wt. Perhaps wt requires a quadratic term? Let's try that using the poly function, which creates orthogonal polynomials. lm.cars2 <- lm(mpg ~ poly(wt,2) + disp + drat + qsec, data=mtcars) par(mfrow=c(2,2)) termplot(lm.cars2, partial.resid=TRUE, col.res = "purple", smooth=panel.smooth)  We now see a better fit with the regressed line matching closely to the smooth line. Simulating responses from a linear model Say you fit a model using R's lm() function. What you get in return are coefficients that represent an estimate of the linear function that gave rise to the data. The assumption is the response of the model is normally distributed with a mean equal to the linear function and a standard deviation equal to the standard deviation of the residuals. Using notation we express this as follows for a simple linear model with intercept and slope coefficients: $Y_{i} \sim N(\beta_{0} + \beta_{1}x_{i},\sigma)$ The implication of this in the world of statistical computing is that we can simulate responses given values of x. R makes this wonderfully easy with its simulate() function. Simply pass it the model object and the number of simulations you want and it returns a matrix of simulated responses. A quick example: set.seed(1) # generate data from the linear function y = 3 + 4.2x with random noise x <- seq(12, 23, length = 100) y <- 3 + 4.2 * x + rnorm(100, 0, 5) mod <- lm(y ~ x) summary(mod)  ## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.700 -3.029 0.078 2.926 11.487 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.900 2.504 1.56 0.12 ## x 4.180 0.141 29.70 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.51 on 98 degrees of freedom ## Multiple R-squared: 0.9, Adjusted R-squared: 0.899 ## F-statistic: 882 on 1 and 98 DF, p-value: <2e-16  Regressing y on x estimates the coefficients to be about 3.9 and 4.18, pretty close to the true values of 3 and 4.2. It also estimates the standard deviation of the error term to be 4.51, not too far from the 5 we specified in rnorm() when we generated the noise. Now let's use our model to simulate responses (i.e., y values). head(simulate(mod, 5))  ## sim_1 sim_2 sim_3 sim_4 sim_5 ## 1 51.26 55.90 58.09 58.91 54.40 ## 2 54.71 62.14 49.79 63.08 53.18 ## 3 50.87 62.15 63.88 52.26 49.64 ## 4 56.16 53.96 53.72 53.69 55.50 ## 5 52.96 45.60 63.38 54.04 60.39 ## 6 64.35 67.65 63.20 54.68 63.57  The simulate() function returns a data frame. In this case the data frame has 100 rows and 5 columns. (I use the head() function to only show the first six rows.) Each column represents a simulated set of responses from our linear model given our x values. In the first row we have 5 values drawn from a Normal distribution with mean $$3.89 + 4.17x_{1}$$ and a standard deviation of 4.51. In the second row we have a value drawn from a Normal distribution with mean $$3.89 + 4.17x_{2}$$ and a standard deviation of 4.51. And so on. We could do this “manually” as follows: simResp <- matrix(0, 100, 5) for (i in 1:5) { simResp[, i] <- rnorm(5, c(coef(mod)[1] + coef(mod)[2] * x), rep(summary(mod)$sigma,
5))
}
simResp[1:6, ]

##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] 52.52 48.93 54.80 52.14 52.90
## [2,] 61.30 47.77 56.02 53.17 46.46
## [3,] 57.37 53.98 53.25 46.90 63.04
## [4,] 57.90 64.48 49.14 54.33 63.41
## [5,] 55.30 56.91 67.99 54.80 59.03
## [6,] 52.52 48.93 54.80 52.14 52.90


But the simulate() function is much faster and easier to use.

Now 5 simulations is rather puny. A better number would be 1000. Then we can use our simulated y values to generate 1000 linear models, and thus have 1000 coefficient estimates. We can then use those estimates to create kernel density plots that allow us to visualize the amount of uncertainty in our coefficient estimates. Let's do that. Notice how I convert the output of the simulate() function to a matrix. That allows me to feed the 1000 responses to the lm() function in one fell swoop instead of coding a for loop to iterate through each column.

simResp <- data.matrix(simulate(mod, 1000))
modSim <- lm(simResp ~ x)


modSim is the usual linear model object, but with the results of 1000 models instead of 1. Here are the coefficients for the first 5 models:

coef(modSim)[1:2, 1:5]

##               sim_1 sim_2 sim_3 sim_4 sim_5
## (Intercept) -0.5932 3.949 3.861 5.683 5.038
## x            4.3836 4.166 4.199 4.062 4.113


Let's create a scatterplot of these coefficients.

plot(modSim$coefficients[1, ], modSim$coefficients[2, ], xlab = "intercept",
ylab = "slope", main = "Scatterplot of coefficients from simulated models")


We see that the variability of the slope estimates is pretty small (3.8 – 4.6), while the intercept estimates vary widely (from below 0 to 10). This jives with our original model (see output above) where the slope coefficient was signficant with a small standard error while the intercept was not significant and had a relatively large standard error.

Next we create kernel density plots of our coefficient estimates. In addition I add a vertical line to represent the mean of the 1000 coefficient estimates and add the mean itself to the plot using the text() function. I use the difference in the range of the axes to allow me to automatically determine suitable coordinates.

d1 <- density(modSim$coefficients[2, ]) d2 <- density(modSim$coefficients[1, ])
# slope
plot(d1, main = "distribution of slope coefficients")
abline(v = mean(modSim$coefficients[2, ])) text(x = mean(modSim$coefficients[2, ]) + diff(range(d1$x)) * 0.05, y = diff(range(d1$y)) *
0.05, labels = round(mean(modSim$coefficients[2, ]), 2))  # intercept plot(d2, main = "distribution of intercept coefficients") abline(v = mean(modSim$coefficients[1, ]))
text(x = mean(modSim$coefficients[1, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d2$y)) * 0.05, labels = round(mean(modSim$coefficients[1, ]), 2))


If you don't compare the scale of the x-axis in the two plots the coefficients appear to have similar variability. Perhaps a better way to do it is to put both plots in one window and set the limits to the x-axis for both plots to be the range of the estimated intercept coefficients.

par(mfrow = c(2, 1))
# slope
plot(d1, main = "distribution of slope coefficients", xlim = range(d2$x)) abline(v = mean(modSim$coefficients[2, ]))
text(x = mean(modSim$coefficients[2, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d1$y)) * 0.05, labels = round(mean(modSim$coefficients[2, ]), 2))
# intercept
plot(d2, main = "distribution of intercept coefficients", xlim = range(d2$x)) abline(v = mean(modSim$coefficients[1, ]))
text(x = mean(modSim$coefficients[1, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d2$y)) * 0.05, labels = round(mean(modSim$coefficients[1, ]), 2))


par(mfrow = c(1, 1))


Now we see just how uncertain our intercept estimate is compared to the slope estimate.

Finally, why don't we go ahead and plot all 1000 fitted lines over the original data along with our original fitted line and the traditional 95% confidence band we get using our original model.

plot(x, y, col = "red", pch = 19)
abline(mod)  # original fitted line
apply(coef(modSim), 2, abline, col = "gray", lty = 3)

# predicted responses from original model
modOrig <- predict(mod, interval = "confidence")
lines(x, modOrig[, "lwr"], col = "blue")
lines(x, modOrig[, "upr"], col = "blue")


The plotting of the 1000 lines essentially provides us a confidence band for the original fitted line. We see the blue lines are slightly closer to the fitted line than the band produced by our simulated models. That's because the blue lines represent the 95% confidence band. In other words, we would expect 95% of our 1000 estimated lines to fit within those blue lines. The gray lines we see outside of the blue lines are those 5% that don't fall within the 95% confidence band.

Exploring Unordered Contrasts in R

Contrasts in R determine how linear model coefficients of categorical variables are interpreted. The default contrast for unordered categorical variables is the Treatment contrast. This means the “first” level (aka, the baseline) is rolled into the intercept and all subsequent levels have a coefficient that represents their difference from the baseline. That’s not too hard to grasp. But what about other contrasts, namely the Helmert and Sum contrasts? What do they do? Instead of explaining them, I figured I would demonstrate each.

First, let’s make some data:


set.seed(2112)
# create 3 levels, 10 each
flevels <- factor(rep(c("A","B","C"),c(10,10,10)))
# create some "nice" data, sorted so means at each level have good separation
vals <- sort(round(runif(30,3,15)))
# calculate mean of each level for reference
means <- tapply(vals,flevels,mean)
means
A    B    C
6.0 10.1 12.9


Our data consist of three levels of arbitrary values. "flevels" is our categorical variable. Notice I explicitly defined it to be a factor using the factor() function. I need to do this so R knows this variable is a factor and codes it according to whatever contrast setting we decide to use.

Let's verify the default unordered contrast setting is the Treatment contrast:


options("contrasts")
$contrasts unordered ordered "contr.treatment" "contr.poly"  Indeed it is. This means our factor levels are coded as follows:  contrasts(flevels) B C A 0 0 B 1 0 C 0 1  This is a 3 x 2 matrix. The 2 columns of the matrix tells us that our model will have 2 coefficients, one for the B level and one for the C level. Therefore the A level is the baseline. The coefficients we get in our linear model for B and C will indicate the respective differences of their mean from the level A mean. The values in the rows tell us what values to plug into the model to get the means for the row labels. For example, to get the mean for A we plug in 0's for both coefficients which leaves us with the intercept. Therefore the intercept is the mean of A. Let's see all this in action before we explore the Helmert and Sum contrasts.  m.trt <- lm(vals ~ flevels) summary(m.trt) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.0000 0.3935 15.249 8.63e-15 *** flevelsB 4.1000 0.5564 7.368 6.32e-08 *** flevelsC 6.9000 0.5564 12.400 1.17e-12 ***  Now we can verify how the Treatment contrast works by extracting the coefficient values from the model and comparing to the means we calculated earlier:  # Intercept = mean of A coef(m.trt)[1] (Intercept) 6 means[1] A 6 # flevelsB = mean of B - mean of A coef(m.trt)[2] flevelsB 4.1 means[2] - means[1] B 4.1 # flevelsC = mean of C - mean of A coef(m.trt)[3] flevelsC 6.9 means[3] - means[1] C 6.9  Let's also verify that plugging in the row values of the contrast matrix returns the means of each level:  # plug in row values into model to get the means of A, B and C, respectively means A B C 6.0 10.1 12.9 # mean of A --> row 1: 0 0 coef(m.trt)[1] + 0*coef(m.trt)[2] + 0*coef(m.trt)[3] (Intercept) 6 # mean of B --> row 2: 0 0 coef(m.trt)[1] + 1*coef(m.trt)[2] + 0*coef(m.trt)[3] (Intercept) 10.1 # mean of C --> row 3: 0 0 coef(m.trt)[1] + 0*coef(m.trt)[2] + 1*coef(m.trt)[3] (Intercept) 12.9  So that's how Treatment contrasts work. Now let's look at Helmert contrasts. "The coefficients for the Helmert regressors compare each level with the average of the "preceding" ones", says Fox in his book An R and S-Plus Companion to Applied Regression. I guess that makes sense. Kind of. Eh, not really. At least not to me. I say we do as we did before: fit a model and compare the coefficients to the means and see what they mean. Before we do that we need to set the contrast to Helmert:  # set contrast to "contr.helmert" contrasts(flevels) <- "contr.helmert" contrasts(flevels) # take a look [,1] [,2] A -1 -1 B 1 -1 C 0 2  Interesting. Notice the column labels are no longer associated with the levels of the factor. They just say 1 and 2. However this still tells us that our model will have two coefficients. Again the row values tell us what to plug in to get the means of A, B and C, respectively. To get the mean of A, we plug in -1 and -1 to the model. This means our intercept has a different interpretation. Let's fit the linear model and investigate.  m.hel <- lm(vals ~ flevels) summary(m.hel) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.6667 0.2272 42.553 < 2e-16 *** flevels1 2.0500 0.2782 7.368 6.32e-08 *** flevels2 1.6167 0.1606 10.064 1.24e-10 ***  It turns out the intercept is the mean of the means, the first coefficient is the mean of the first two levels minus the first level, and the second coefficient is the mean of all three levels minus the mean of the first two levels. Did you get that? Here, this may help:  # intercept = mean of all means coef(m.hel)[1] (Intercept) 9.666667 mean(means) [1] 9.666667 # flevels1 = mean of first two levels minus first level coef(m.hel)[2] flevels1 2.05 mean(means[1:2]) - means[1] A 2.05 # flevels2 = mean of all three levels minus mean of first two levels coef(m.hel)[3] flevels2 1.616667 mean(means) - mean(means[1:2]) [1] 1.616667  Let's do that thing again where we plug in the row values of the contrast matrix to verify it returns the means of the levels:  means A B C 6.0 10.1 12.9 # mean of A --> row 1: -1 -1 coef(m.hel)[1] + -1*coef(m.hel)[2] + -1*coef(m.hel)[3] (Intercept) 6 # mean of B --> row 2: 1 -1 coef(m.hel)[1] + 1*coef(m.hel)[2] + -1*coef(m.hel)[3] (Intercept) 10.1 # mean of C --> row 3: 0 2 coef(m.hel)[1] + 0*coef(m.hel)[2] + 2*coef(m.hel)[3] (Intercept) 12.9  That leaves us with the Sum contrast. Regarding models fitted with the Sum contrasts, Fox tells us that "each coefficient compares the corresponding level of the factor to the average of the other levels." I think like Helmert contrasts, this one is better demonstrated. As before we need to change the contrast setting.  # set contrast to "contr.sum" contrasts(flevels) <- "contr.sum" contrasts(flevels) # take a look [,1] [,2] A 1 0 B 0 1 C -1 -1  Just like the Helmert contrast we see two columns with no labels. Our model will have two coefficients that don't correspond directly to the levels of our factors. By now we know the values in the rows are what we plug into our model to get the means of our levels. To get the mean of level A, we plug in 1 and 0. Time to fit the model and investigate:  m.sum <- lm(vals ~ flevels) summary(m.sum) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.6667 0.2272 42.553 < 2e-16 *** flevels1 -3.6667 0.3213 -11.413 7.75e-12 *** flevels2 0.4333 0.3213 1.349 0.189  Like the Helmert contrasts, our intercept is mean of all means. But our two coefficients have different interpretations. The first is the mean of all means minus the mean of level 1 (A). The second coefficient is the mean of all means minus the mean of level 2 (B). Notice in the model output above that the second coefficient is not significant. In other words, the mean of level B is not significantly different from the mean of all means.  # intercept = mean of all means coef(m.sum)[1] (Intercept) 9.666667 mean(means) [1] 9.666667 # flevels1 = mean of all means - mean of level 1 (here, A) coef(m.sum)[2] flevels1 -3.666667 means[1] - mean(means) A -3.666667 # flevels2 = mean of all means - mean of level 2 (here, B) coef(m.sum)[3] flevels2 0.4333333 means[2] - mean(means) B 0.4333333  Finally to be complete we plug in the row values of the Sum contrast matrix to verify it returns the means of the factor levels:  means A B C 6.0 10.1 12.9 # mean of A -->row 1: 1 0 coef(m.sum)[1] + 1*coef(m.sum)[2] + 0*coef(m.sum)[3] (Intercept) 6 # mean of B -->row 2: 0 1 coef(m.sum)[1] + 0*coef(m.sum)[2] + 1*coef(m.sum)[3] (Intercept) 10.1 # mean of C -->row 3: -1 -1 coef(m.sum)[1] + -1*coef(m.sum)[2] + -1*coef(m.sum)[3] (Intercept) 12.9  And finally we wrap up this exercise by returning the contrast level of our categorical variable back to the system default:  contrasts(flevels) <- NULL  Hopefully this helps you get a better handle on what Helmert and Sum contrasts do. The Multilevel Model for Change (Ch 3 of ALDA) – revisited In my previous post I talked about how to replicate the example in Chapter 3 of ALDA. It was mostly about finding the dataset and modifying the R code on the UCLA web site so it works. In this post I want to talk a little more about the statistics. Recall the example involves two groups of infants: one assigned to a program to enhance cognitive functioning and the other acting as a control. Cognitive measurements were taken at three different time periods for both groups. Did the group assigned to the program perform differently than the control group? To answer this question the authors postulate a linear model where cognitive test results are explained by time, as follows: $Y = \beta_{0} + \beta_{1}*time$ But the intercept and slope coefficients in that model are modeled as follows: $\beta_{0} = \gamma_{00} + \gamma_{01}*program$ $\beta_{1} = \gamma_{10} + \gamma_{11}*program$ So we have two levels of modeling happening at the same time. The first level concerns within-person change while the second level concerns between-person differences in change. We can consolidate the two levels into one formula, like this: $Y = \gamma_{00} + \gamma_{10}*time + \gamma_{01}*program + \gamma_{11}*time*program$ So we have an intercept and three coefficients. When we fit the model to the data, we get: $Y = 107.84 - 21.13*time + 6.85*program + 5.27*program*time$ All of which are significant coefficients. When program = 0, our linear model is $Y = 107.84 - 21.13*time$. When program = 1, our linear model is $Y = 114.69 - 15.86*time$. The intercept in these models is interpreted as the cognitive score at the first measurement (when the infants were 12 months old). We can see that infants in the program had a higher performance at the first measurement than those not in the program: 114.69 – 107.84 = 6.85. The slope tells us the rate of decline of cognitive performance (decline?). We see the infants in the program had a slower rate of decline over time: -15.86 versus -21.13. Or put another way: -21.13 – -15.86 = -5.27, which is the coefficient of the interaction. That is, the difference in slopes between the two groups is -5.27. Now it’s interesting to note that the model does not appear to make use of the fact that each subject contributed three waves of data. Our dataset has 309 records. But those 309 records are in 103 groups. Those 103 groups are the 103 infants. Each infant contributed three cognitive test scores over time. The dataset has a variable, ID, to indicate those groups. But ID is not in our model. Shouldn’t we make use of that information? Well, in fact we did. Have a look at the R code: lme(cog ~ time*program, data=ch3, random = ~ time | id, method="ML", control=list(opt = "optim"))  Notice how “id” indicates the grouping structure in the “random” argument. Time is specified as the random effect and “| id” indicates it is grouped by “id” (i.e., the 103 infants). In so many words, this allows us to capture the variability in each infant’s own change trajectory. We can think of plotting the cognitive test score for one infant over time and fitting a line to those three points. There will be some error in that line. But not as much error than if we fit a line to all infants in a group over the three times. In this latter scenario we’re not accounting for the grouping of the measurements by infant. We can actually see what happens if we don’t account for this grouping by doing an Analysis of Covariance (ANCOVA). With ANCOVA, we’re basically doing regression with continuous and categorical variables. The usual approach to ANCOVA is to think of doing a regular ANOVA analysis but blocking on a continuous variable. For example, comparing cholesterol levels (y) between a treated group and a reference group adjusted for age (x, in years). We’re interested in the treatment effect but we want to account for the effect of age. We can naively do ANCOVA with the Chapter 3 example from ALDA as follows: lm(cog ~ program + time + time:program, data=ch3)  Look at the results: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 107.841 1.773 60.822 < 2e-16 *** program 6.855 2.363 2.901 0.00399 ** time -21.133 2.747 -7.694 1.99e-13 *** program:time 5.271 3.660 1.440 0.15087  Now compare those to the multilevel modelling results, obtained from the call to lme() above:  Value Std.Error DF t-value p-value (Intercept) 107.84074 2.048799 204 52.63608 0.0000 time -21.13333 1.903664 204 -11.10140 0.0000 program 6.85466 2.730259 101 2.51063 0.0136 time:program 5.27126 2.536850 204 2.07788 0.0390  Notice the similarities? That's right, both return the same model coefficients! But compare the difference in standard errors. The most dramatic is the interaction between time and program. In the ANCOVA analysis the interaction appears to be insignificant (SE = 3.7; p = 0.15). But in the multilevel model it's significant at the 5% level (SE = 2.5; p = 0.03). We see that the ANCOVA model does not take into account the change trajectories at the individual level, and is thus not sensitive enough to detect the significant difference in rates of cognitive decline between the infants in the program and those in the control group. Simulation to Represent Uncertainty in Regression Coefficients Working through Gelman and Hill’s book can be maddening. The exposition is wonderful but recreating their examples leads to new and exciting levels of frustration. Take the height and earnings example in chapter 4. It’s a simple linear regression of earnings on height. You’d think you could download the data from the book web site, use the code in the book, and reproduce the example. They sure lead you to think that. But it doesn’t work that way. For starters, when you load the data it has 2029 records. However the output of the regression in the book shows n = 1192. So subsetting needs to be done. As far as I can tell, the subsetting is not discussed in the book. Now the author has an “earnings” folder on his site under an “examples” directory, which contains a R file called “earnings_setup.R“. (Never mind the book itself doesn’t appear to mention this directory on the web site.) So this seems to be the place where we find out how to subset the data. The key line of code is ok <- !is.na (earn+height+sex+age) & earn>0 & yearbn>25  which creates a logical vector to subset the data. But when you run it and request the dimensions of the data frame you have 1059 records, not 1192! After trial and error I believe the subset to reproduce the results in the book should be ok <- !is.na (earn+height+sex) & earn>0  That gave me 1192. For the record, here’s my full code: heights <- read.dta ("http://www.stat.columbia.edu/~gelman/arm/examples/earnings/heights.dta") attach(heights) male <- 2 - sex # make male 1 (and therefore female = 0) ok <- !is.na (earn+height+sex) & earn>0 heights.clean <- as.data.frame (cbind (earn, height, sex, male)[ok,]) heights.clean$log.earn <- log(heights.clean$earn)  OK, so now(!) we can reproduce their example: earn.logmodel.3 <- lm (log.earn ~ height + male + height:male, data=heights.clean)  The reason I was interested in this example was for another example in chapter 7 on "using simulation to represent uncertainty in regression coefficients" (p. 142). In other words, instead of using the standard errors and intervals obtained from the predict() function in R, we compute uncertainties by simulation. It seems easy enough using the sim() function from the book's arm package. You do something like the following: library(arm) sim.1 <- sim(earn.logmodel.3, 1000)  The sim function takes two arguments: your model and the number of simulations. It returns a vector of simulated residual standard deviations and a matrix of simulated regression coefficients. We can use these to calculate confidence intervals for coefficients that do not have standard errors in the regression output. Their example asks "what can be said about the coefficient of height among men?" If you look up at the model, you can see it contains an interaction term of height and male. To answer the question we can't just look at the standard error of the height coefficient or just the interaction coefficient. There is no simple way to do what they ask using regression output. So the book instructs us to use the output of the sim function to answer this as follows: height.for.men.coef <- sim.1$beta[,2] + sim.1$beta[,4] quantile(height.for.men.coef, c(0.025,0.975))  Except that doesn't work. More frustration. It produces an error that says "Error in sim.1$beta : $operator not defined for this S4 class" (instead of giving us a 95% interval). With some googling and persistence I was able to determine that the following is how it should be done: height.for.men.coef <- sim.1@coef[,2] + sim.1@coef[,4] quantile(height.for.men.coef, c(0.025,0.975)) 2.5% 97.5% -0.0004378039 0.0507464098  Notice that "@coef" replaces "$beta". And with that I was able to finally reproduce the example I was interested in!

Now about this simulation function. While I appreciate functions that save time and make life easy, I do like to know how they work. Fortunately Gelman and Hill provide pseudo-code in the book. It goes like this:

1. Run your regression to compute the vector $\hat{\beta}$ of estimated parameters, the unscaled estimation covariance matrix $V_{\beta}$, and the residual variance $\hat{\sigma^{2}}$
2. Create n random simulations for the coefficient vector $\beta$ and residual standard deviation. For each simulation draw:
1. Simulate $\sigma = \hat{\sigma}\sqrt{(n - k)/X}$ where X is a random draw from the $\chi^{2}$ distribution with n - k degrees of freedom.
2. Given the random draw of $\sigma$, simulate $\beta$ from a multivariate normal distribution with mean $\hat{\beta}$ and variance matrix $\sigma^{2}V_{\beta}$

Not too bad. Let's use this to manually run our own simulations, so we have an idea of how the sim() function works. (Plus you may not want to use the arm package as it requires loading 9 more packages.)

Step 1 is easy enough. That's just running your regression as you normally would. Next we need to extract our estimated parameters, the unscaled covariance matrix and the residual standard deviation. We also need to snag degrees of freedom for our chi-square random draw. Here's how we can get them:

# extract coefficents
earn.logmodel.3$coef # extract residual standard error from model summary(earn.logmodel.3)$sigma

# extract unscaled covariance matrix
summary(earn.logmodel.3)$cov.unscaled # extract k and n - k; first two elements in vector summary(earn.logmodel.3)$df


Let's use this information to do a single simulation.

s.hat <- summary(earn.logmodel.3)$sigma n.minus.k <- summary(earn.logmodel.3)$df[2]
library(MASS) # need for mvrnorm function to simulate draws from multivariate normal dist'n
# simulate residual standard deviation
(s.hat.sim <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k)))
[1] 0.8591814
# use simulated residual standard deviation to simulate regression coefficients
mvrnorm(1,earn.logmodel.3$coef, s.hat.sim^2*summary(earn.logmodel.3)$cov.unscaled)
(Intercept)       height         male  height:male
7.906605029  0.025163124 -0.160828921  0.007904422


That seems to work. How about we try doing 1000 simulations. Here's one way:

n <- 1000
sim.2.sigma <- rep(NA,n)
sim.2.coef <- matrix(NA,n,4)
for (i in 1:n){
sim.2.sigma[i] <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k))
sim.2.coef[i,] <- mvrnorm(1,earn.logmodel.3$coef,sim.2.sigma[i]^2*summary(earn.logmodel.3)$cov.unscaled)
}


Now let's see how our simulation results compare to what we got using the sim() function:

height.for.men.coef.2 <- sim.2.coef[,2] + sim.2.coef[,4]
quantile(height.for.men.coef.2, c(0.025,0.975))
2.5%        97.5%
-0.001828216  0.049499381


Looks similar to what we got above. Nice. It's probably better to just use the sim() function for this sort of thing, but at least now we know a little more about what it's doing.

A Logistic Regression Checklist

I recently read The Checklist Manifesto by Atul Gawande and was fascinated by how relatively simple checklists can improve performance and results in such complex endeavors as surgery or flying a commercial airplane. I decided I wanted to make a checklist of my own for Logistic regression. It ended up not being a checklist on how to do it per se, but rather a list of important facts to remember. Here’s what I came up with.

• Logistic regression models the probability that y = 1, $P(y_{i} = 1) = logit^{-1}(X_{i}\beta)$ where $logit^{-1} = \frac{e^{x}}{1+e^{x}}$
• Logistic predictions are probabilistic. It predicts a probability that y = 1. It does not make a point prediction.
• The function $logit^{-1} = \frac{e^{x}}{1+e^{x}}$ transforms continuous values to the range (0,1).
• Dividing a regression coefficient by 4 will give an upper bound of the predictive difference corresponding to a unit difference in x. For example if $\beta = 0.33$, then $0.33/4 = 0.08$. This means a unit increase in x corresponds to no more than a 8% positive difference in the probability that y = 1.
• The odds of success (i.e., y = 1) increase multiplicatively by $e^{\beta}$ for every one-unit increase in x. That is, exponentiating logistic regression coefficients can be interpreted as odds ratios. For example, let’s say we have a regression coefficient of 0.497. Exponentiating gives $e^{0.497} = 1.64$. That means the odds of success increase by 64% for each one-unit increase in x. Recall that odds = $\frac{p}{1-p}$. If our predicted probability at x is 0.674, then the odds of success are $\frac{0.674}{0.326} = 2.07$. Therefore at x + 1, the odds will increase by 64% from 2.07 to $2.07(1.64) = 3.40$. Notice that $1.64 = \frac{3.40}{2.70}$, which is an odds ratio. The ratio of odds of x + 1 to x will always be $e^{\beta}$, where $\beta$ is a logistic regression coefficient.
• Plots of raw residuals from logistic regression are generally not useful. Instead it’s preferable to plot binned residuals “by dividing the data into categories (bins) based on their fitted values, and then plotting the average residual versus the average fitted value for each bin.” (Gelman & Hill, p. 97). Example R code for doing this can be found here.
• The error rate is the proportion of cases in your model that predicts y = 1 when the case is actually y = 0 in the data. We predict y = 1 when the predicted probability exceeds 0.5. Otherwise we predict y = 0. It’s not good if your error rate equals the null rate. The null rate is usually the proportion of 0’s in your data. In other words, if you guessed all cases in your data are y = 1, then the null rate is the percentage you guessed wrong. Let’s say your data has 58% of y = 1 and 42% of y = 0, then the null rate is 42%. Further, let’s say you do some logistic regression on this data and your model has an error rate of 36%. That is, 36% of the time it predicts the wrong outcome. This means your model does only 4% better than simply guessing that all cases are y = 1.
• Deviance is a measure of error. Lower deviance is better. When an informative predictor is added to a model, we expect deviance to decrease by more than 1. If not, then we’ve likely added a non-informative predictor to the model that just adds noise.
• If a predictor x is completely aligned with the outcome so that y = 1 when x is above some threshold and y = 0 when x is below some threshold, then the coefficient estimate will explode to some gigantic value. This means the parameter cannot be estimated. This is an identifiability problem called separation.

Most of this comes from Chapter 5 of Data Analysis Using Regression and Multilevel/Hierarchical Models by Gellman and Hill. I also pulled a little from chapter 5 of An Introduction to Categorical Data Analysis by Agresti.