Category Archives: Using R

Partial Residuals and the termplot function

Earlier this year I taught an Intro to R Graphics workshop. While preparing for it I came across the termplot function. I had no idea it existed. Of course now that I know about it, I come across it fairly regularly. (That's the Baader-Meinhof Phenomenon for you.)

?termplot will tell you what it does: “Plots regression terms against their predictors, optionally with standard errors and partial residuals added.” If you read on, you'll see “Nothing sensible happens for interaction terms, and they may cause errors.” Duly noted.

Let's try this out and see what it does.

First we'll fit a model using the mtcars dataset that comes with R. Below mpg = Miles/(US) gallon, drat = Rear axle ratio, qsec = ¼ mile time, disp = Displacement (, and wt = Weight (lb/1000). <- lm(mpg ~ wt + disp + drat + qsec, data=mtcars)
## Call:
## lm(formula = mpg ~ wt + disp + drat + qsec, data = mtcars)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.143 -1.933 -0.149  0.919  5.541 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.03223    9.58829    0.94  0.35454    
## wt          -4.86768    1.20872   -4.03  0.00041 ***
## disp         0.00522    0.01105    0.47  0.64021    
## drat         1.87086    1.32462    1.41  0.16926    
## qsec         1.05248    0.34778    3.03  0.00539 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.6 on 27 degrees of freedom
## Multiple R-squared:  0.838,  Adjusted R-squared:  0.814 
## F-statistic:   35 on 4 and 27 DF,  p-value: 2.55e-10

plot of chunk unnamed-chunk-1

Compare the coefficients in the summary output to the slopes of the lines in the graphs. See how the coefficients match the slopes of the lines? That's because they are the slopes of the lines. For example, wt has a coefficient of about -5. Likewise the termplot for that coefficient has a negative slope that appears to be about -5. Using these plots we can see at a glance the respective contributions of the predictors. wt and qsec seem to contribute to the model due to the steepness of their slopes, but disp and drat not so much. This matches the summary output that shows wt and qsec as being significant.

Now termplot also has an argument called partial.resid that will add partial residuals if set to TRUE. Let's try that out, with color set to “purple” since the default gray can be hard to see (at least to my eyes):

termplot(, partial.resid=TRUE, col.res = "purple")

plot of chunk unnamed-chunk-2

How are these points generated? The x-axis is clear enough. That's just the predictor values. But the y-axis says “Partial for [variable name]”, i.e. partial residuals. What does that mean? Put as simply as possible, partial residuals are the sum of the residuals and predictor terms. Hopefully you know what residuals are. Those are just the difference between the observed and fitted response values. “Predictor terms” is a little trickier.

To understand them, it helps to know that

\[y = \bar{y} + b_{1}(x_{1} – \bar{x}_{1}) + b_{2}(x_{2} – \bar{x}_{2}) + b_{3}(x_{3} – \bar{x}_{3}) + b_{4}(x_{4} – \bar{x}_{4}) + e\]

is another way of writing

\[y = b_{0} + b_{1}x_{1} + b_{2}x_{2}+ b_{3}x_{3}+ b_{4}x_{4} + e\]

So the first equation above has \(\bar{y}\) as the intercept and centered predictors. If we center the predictors and fit a model, we'll see that's precisely what we get:

# center the variables
mtcars <- transform(mtcars, wtC = wt-mean(wt), dispC = disp-mean(disp),
                    dratC=drat-mean(drat), qsecC=qsec-mean(qsec))
# now fit model using centered predictors <- lm(mpg ~ wtC + dispC + dratC + qsecC, data=mtcars)
## (Intercept) 
##       20.09
## [1] 20.09

Now if we take the centered predictor values and multiply them by their respective coefficients, we get what are referred to in R as terms. For example, the terms for the first record in the data set are as follows:

# take all coefficients but the intercept...
##       wtC     dispC     dratC     qsecC 
## -4.867682  0.005222  1.870855  1.052478
# ...and first record of data...$model[1,-1]
##               wtC  dispC  dratC  qsecC
## Mazda RX4 -0.5972 -70.72 0.3034 -1.389
# ...and multiply:
##       Mazda RX4
## wtC      2.9072
## dispC   -0.3693
## dratC    0.5677
## qsecC   -1.4616

So for the first record, the term due to wt is 2.91, the term due to disp is -0.37, and so on. These are the predictor terms I mentioned earlier. These are what we add to the residuals to make partial residuals.

You'll be happy to know we don't have to center predictors and fit a new model to extract terms. R can do that for us on the original model object when you specify type="terms" with the predict function, like so:

# notice we're calling this on the original model object,
pterms <- predict(, type="terms")
pterms[1,] # compare to above; the same
##      wt    disp    drat    qsec 
##  2.9072 -0.3693  0.5677 -1.4616

To calculate the partial residuals ourselves and make our own partial residual plots, we can do this:

# add residuals to each column of pterms (i.e., the terms for each predictor)
partial.residuals <- apply(pterms,2,function(x)x+resid(
# create our own termplot of partial residuals
# the model in includes the response in first column, so we index with i + 1
for(i in 1:4){
  plot($model[,(i+1)],y=partial.residuals[,i], col="purple", ylim=c(-10,10))
  abline(lm(partial.residuals[,i] ~$model[,(i+1)]), col="red")

plot of chunk unnamed-chunk-6

So that's what termplot does for us: it takes the terms for each predictor, adds the residuals to the terms to create partial residuals, and then plots partial residuals versus their respective predictor, (if you specify partial.resid=TRUE). And of course it plots a fitted line, the result of regressing the predictor's partial residuals on itself. Recall ealier I mentioned the slope of the line in the termplot graphs is the coefficient in the summary output. We can indeed verify that:

# coefficient for wt
##     wt 
## -4.868
coef(lm(partial.residuals[,1] ~ mtcars$wt))[2]
## mtcars$wt 
##    -4.868

Ultimately what this whole partial residual/termplot thing is doing is splitting the response value into different parts: an overall mean, a term that is due to wt, a term that is due to disp, a term that is due to drat, and a term that is due to qsec. And you have the residual. So when we create the partial residual for wt, what we're doing is adding the wt term and the residual. This sum accounts for the part of the response not explained by the other terms.

Again let's look at the overall mean, the terms for the first observation and its residual:

# overall mean of response
## [1] 20.09
# terms of first observation
##      wt    disp    drat    qsec 
##  2.9072 -0.3693  0.5677 -1.4616
# the residual of the first observation
## Mazda RX4 
##   -0.7346

If we add those up, we get the observed value

# add overall mean, terms and residual for first observation
mean(mtcars$mpg) + sum(pterms[1,]) + resid([1]
## Mazda RX4 
##        21
# same as observed in data
## [1] 21

So the wt term value of 2.9072 represents the part of the response that is not explained by the other terms.

Finally, we can add another argument to termplot, smooth=panel.smooth that will draw a smooth lowess line through the points. This can help us assess departures from linearity.

termplot(, partial.resid=TRUE, col.res = "purple", smooth=panel.smooth)

plot of chunk unnamed-chunk-10

Notice how the dashed line bends around the straight line for wt. Perhaps wt requires a quadratic term? Let's try that using the poly function, which creates orthogonal polynomials.

lm.cars2 <- lm(mpg ~ poly(wt,2) + disp + drat + qsec, data=mtcars)
termplot(lm.cars2, partial.resid=TRUE, col.res = "purple", smooth=panel.smooth)

plot of chunk unnamed-chunk-11

We now see a better fit with the regressed line matching closely to the smooth line.

Permutation Tests

Let's talk about permutation tests and why we might want to do them.

First think about the two-sample t-test. The null hypothesis of this test is that both samples come from the same distribution. If we assume both samples come from the same approximately normal distribution, we can use math formulas based on probability theory to calculate a test statistic. We can then calculate the probability of getting such a test statistic (or one greater) under the assumption of both samples coming from the same distribution. For example, let's draw two samples from the same normal distribution and run a t-test.

s1 <- round(rnorm(10, 100, 16))
s2 <- round(rnorm(10, 100, 16))
t.test(s1, s2, var.equal = TRUE)
##  Two Sample t-test
## data:  s1 and s2
## t = 0.8193, df = 18, p-value = 0.4233
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.508 17.108
## sample estimates:
## mean of x mean of y 
##     102.4      97.6

As expected the p-value is high, telling us the difference in their means is not significant. Another interpretation is that the resulting t-test statistic (0.8193) would be likely if both samples came from the same distribution.

In real life we don't know with certainty the distribution of the population from which our samples are drawn. We also don't know the variance of the distribution or whether the distribution is even symmetric. We also may be stuck with a small sample size. Fortunately the t-test is pretty robust and usually reliable even when its assumptions are wrong. However, if you have your doubts, you can try a permutation test.

In the case our two-sample example above, the permutation test takes all possible combinations of group membership and creates a permutation distribution. In other words, if we assume both samples came from the same population, a data point in group 1 is just as likely to appear in group 2. If we determine all possible permutations, we can calculate our statistic of interest for each permutation and create a distribution. We can then assess where our original statistic falls in this distribution. If it's in the tails then we have evidence that our two samples come from two different populations. Now if your sample sizes are bigger than say, 15, creating the permutations can get computationally expensive very fast. So this is probably not something you want to do unless you're dealing with small samples.

Here's how we can do a permutation test “by hand” in R with our fake data:

d0 <- mean(s1) - mean(s2)
alls <- c(s1, s2)  # combine into one vector
N <- length(alls)
n <- length(s1)
p <- combn(N, n)  # generate all combinations of n chosen from N
dmeans <- numeric(dim(p)[2])  # vector to store means
for (i in 1:dim(p)[2]) {
    dmeans[i] <- mean(alls[p[, i]]) - mean(alls[-p[, i]])
# plot histogram of all possible difference in means with lines indicating
# our original difference
abline(v = d0, lty = 2)
abline(v = -d0, lty = 2)

plot of chunk unnamed-chunk-2

The combn function generates a matrix of all possible index combinations of n chosen from N. I say “index” because combn doesn't return the values themselves but rather their index position in the original vector. Thus we can use it to select all possible combinations from our combined vector, which is what we do in the for loop. In this case there are 184756 possible combinations. Once all possible differences in means are calculated we can plot a histogram of the differences and draw some lines to indicate where our original sample falls. Here I drew two lines to indicate the absolute value of the difference, the equivalent of a two-sided test. The probability (or p-value) of getting a difference more extreme than our original sample is pretty high. We can see this because there's a lot of histogram on either side of our lines. We can also calculate this value:

# two-sided p-value
signif((sum(dmeans <= -abs(d0)) + sum(dmeans >= abs(d0)))/dim(p)[2])
## [1] 0.4291

Very close to the p-value returned by the t-test.

Of course there are packages for this sort of thing. The one I know of is perm. Here's how we use its permTS function to replicate our results above:

permTS(s1, s2, alternative = "two.sided", method = "exact.ce", 
control = permControl(tsmethod = "abs"))
##  Exact Permutation Test (complete enumeration)
## data:  s1 and s2
## p-value = 0.4291
## alternative hypothesis: true mean s1 - mean s2 is  0
## sample estimates:
## mean s1 - mean s2 
##               4.8

The method= argument tells the function to do “complete enumeration”. The control= argument says how to calculate the p-value (i.e., the same way we did it “by hand”).

So that's a literal two-sample permutation test. As I mentioned, if your samples are large then this approach is not feasible as the number of permutations grows out of control. For example, two groups of size 20 results in 137,846,528,820 combinations. So we usually resort to resampling methods. This is where we repeatedly resample from our combined vector of values to get a large number of combinations. We don't generate all combinations, but enough to give us a good idea. Here's one way to do it:

dmeans <- numeric(2000)  # vector to store means
for (i in 1:2000) {
    g <- sample(N, n)
    dmeans[i] <- mean(alls[g]) - mean(alls[-g])
abline(v = d0, lty = 2)
abline(v = -d0, lty = 2)

plot of chunk unnamed-chunk-5

signif((sum(dmeans <= -abs(d0)) + sum(dmeans >= abs(d0)))/2000)
## [1] 0.4335

So out of 1.8476 × 105 possible permutations we only calculated 2000 (assuming no repeats) and we still got a p-value pretty close to that of our literal permutation test. Not bad.

We can do the same using the permTS function as follows:

permTS(s1, s2, alternative = "two.sided", 
method = "", control = permControl(nmc = 2000, 
    setSEED = FALSE, tsmethod = "abs"))
##  Exact Permutation Test Estimated by Monte Carlo
## data:  s1 and s2
## p-value = 0.4213
## alternative hypothesis: true mean s1 - mean s2 is  0
## sample estimates:
## mean s1 - mean s2 
##               4.8 
## p-value estimated from 2000 Monte Carlo replications
## 99 percent confidence interval on p-value:
##  0.3925 0.4498

The method “” says use Monte Carlo simulation. The nmc= argument specifies number of replications (the default is 999). The setSEED= argument says not to set a seed. I want different random replications each time I run this line of code.

To wrap up, when does it make sense to use permutation tests? When you have something to permute! I quote from the classic text, An Introduction to the Bootstrap:

Permutation methods tend to apply to only a narrow range of problems. However when they apply, as in testing F = G in a two-sample problem, they give gratifyingly exact answers without parametric assumptions.

Simulating responses from a linear model

Say you fit a model using R's lm() function. What you get in return are coefficients that represent an estimate of the linear function that gave rise to the data. The assumption is the response of the model is normally distributed with a mean equal to the linear function and a standard deviation equal to the standard deviation of the residuals. Using notation we express this as follows for a simple linear model with intercept and slope coefficients:

\[ Y_{i} \sim N(\beta_{0} + \beta_{1}x_{i},\sigma) \]

The implication of this in the world of statistical computing is that we can simulate responses given values of x. R makes this wonderfully easy with its simulate() function. Simply pass it the model object and the number of simulations you want and it returns a matrix of simulated responses. A quick example:

# generate data from the linear function y = 3 + 4.2x with random noise
x <- seq(12, 23, length = 100)
y <- 3 + 4.2 * x + rnorm(100, 0, 5)
mod <- lm(y ~ x)
## Call:
## lm(formula = y ~ x)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.700  -3.029   0.078   2.926  11.487 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.900      2.504    1.56     0.12    
## x              4.180      0.141   29.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 4.51 on 98 degrees of freedom
## Multiple R-squared:   0.9,   Adjusted R-squared:  0.899 
## F-statistic:  882 on 1 and 98 DF,  p-value: <2e-16

Regressing y on x estimates the coefficients to be about 3.9 and 4.18, pretty close to the true values of 3 and 4.2. It also estimates the standard deviation of the error term to be 4.51, not too far from the 5 we specified in rnorm() when we generated the noise. Now let's use our model to simulate responses (i.e., y values).

head(simulate(mod, 5))
##   sim_1 sim_2 sim_3 sim_4 sim_5
## 1 51.26 55.90 58.09 58.91 54.40
## 2 54.71 62.14 49.79 63.08 53.18
## 3 50.87 62.15 63.88 52.26 49.64
## 4 56.16 53.96 53.72 53.69 55.50
## 5 52.96 45.60 63.38 54.04 60.39
## 6 64.35 67.65 63.20 54.68 63.57

The simulate() function returns a data frame. In this case the data frame has 100 rows and 5 columns. (I use the head() function to only show the first six rows.) Each column represents a simulated set of responses from our linear model given our x values. In the first row we have 5 values drawn from a Normal distribution with mean \( 3.89 + 4.17x_{1} \) and a standard deviation of 4.51. In the second row we have a value drawn from a Normal distribution with mean \( 3.89 + 4.17x_{2} \) and a standard deviation of 4.51. And so on. We could do this “manually” as follows:

simResp <- matrix(0, 100, 5)
for (i in 1:5) {
    simResp[, i] <- rnorm(5, c(coef(mod)[1] + coef(mod)[2] * x), rep(summary(mod)$sigma, 
simResp[1:6, ]
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] 52.52 48.93 54.80 52.14 52.90
## [2,] 61.30 47.77 56.02 53.17 46.46
## [3,] 57.37 53.98 53.25 46.90 63.04
## [4,] 57.90 64.48 49.14 54.33 63.41
## [5,] 55.30 56.91 67.99 54.80 59.03
## [6,] 52.52 48.93 54.80 52.14 52.90

But the simulate() function is much faster and easier to use.

Now 5 simulations is rather puny. A better number would be 1000. Then we can use our simulated y values to generate 1000 linear models, and thus have 1000 coefficient estimates. We can then use those estimates to create kernel density plots that allow us to visualize the amount of uncertainty in our coefficient estimates. Let's do that. Notice how I convert the output of the simulate() function to a matrix. That allows me to feed the 1000 responses to the lm() function in one fell swoop instead of coding a for loop to iterate through each column.

simResp <- data.matrix(simulate(mod, 1000))
modSim <- lm(simResp ~ x)

modSim is the usual linear model object, but with the results of 1000 models instead of 1. Here are the coefficients for the first 5 models:

coef(modSim)[1:2, 1:5]
##               sim_1 sim_2 sim_3 sim_4 sim_5
## (Intercept) -0.5932 3.949 3.861 5.683 5.038
## x            4.3836 4.166 4.199 4.062 4.113

Let's create a scatterplot of these coefficients.

plot(modSim$coefficients[1, ], modSim$coefficients[2, ], xlab = "intercept", 
    ylab = "slope", main = "Scatterplot of coefficients from simulated models")

plot of chunk unnamed-chunk-6

We see that the variability of the slope estimates is pretty small (3.8 – 4.6), while the intercept estimates vary widely (from below 0 to 10). This jives with our original model (see output above) where the slope coefficient was signficant with a small standard error while the intercept was not significant and had a relatively large standard error.

Next we create kernel density plots of our coefficient estimates. In addition I add a vertical line to represent the mean of the 1000 coefficient estimates and add the mean itself to the plot using the text() function. I use the difference in the range of the axes to allow me to automatically determine suitable coordinates.

d1 <- density(modSim$coefficients[2, ])
d2 <- density(modSim$coefficients[1, ])
# slope
plot(d1, main = "distribution of slope coefficients")
abline(v = mean(modSim$coefficients[2, ]))
text(x = mean(modSim$coefficients[2, ]) + diff(range(d1$x)) * 0.05, y = diff(range(d1$y)) * 
    0.05, labels = round(mean(modSim$coefficients[2, ]), 2))

plot of chunk unnamed-chunk-7

# intercept
plot(d2, main = "distribution of intercept coefficients")
abline(v = mean(modSim$coefficients[1, ]))
text(x = mean(modSim$coefficients[1, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d2$y)) * 
    0.05, labels = round(mean(modSim$coefficients[1, ]), 2))

plot of chunk unnamed-chunk-7

If you don't compare the scale of the x-axis in the two plots the coefficients appear to have similar variability. Perhaps a better way to do it is to put both plots in one window and set the limits to the x-axis for both plots to be the range of the estimated intercept coefficients.

par(mfrow = c(2, 1))
# slope
plot(d1, main = "distribution of slope coefficients", xlim = range(d2$x))
abline(v = mean(modSim$coefficients[2, ]))
text(x = mean(modSim$coefficients[2, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d1$y)) * 
    0.05, labels = round(mean(modSim$coefficients[2, ]), 2))
# intercept
plot(d2, main = "distribution of intercept coefficients", xlim = range(d2$x))
abline(v = mean(modSim$coefficients[1, ]))
text(x = mean(modSim$coefficients[1, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d2$y)) * 
    0.05, labels = round(mean(modSim$coefficients[1, ]), 2))

plot of chunk unnamed-chunk-8

par(mfrow = c(1, 1))

Now we see just how uncertain our intercept estimate is compared to the slope estimate.

Finally, why don't we go ahead and plot all 1000 fitted lines over the original data along with our original fitted line and the traditional 95% confidence band we get using our original model.

plot(x, y, col = "red", pch = 19)
abline(mod)  # original fitted line
apply(coef(modSim), 2, abline, col = "gray", lty = 3)
# predicted responses from original model
modOrig <- predict(mod, interval = "confidence")
# add 95% confidence band
lines(x, modOrig[, "lwr"], col = "blue")
lines(x, modOrig[, "upr"], col = "blue")

plot of chunk unnamed-chunk-9

The plotting of the 1000 lines essentially provides us a confidence band for the original fitted line. We see the blue lines are slightly closer to the fitted line than the band produced by our simulated models. That's because the blue lines represent the 95% confidence band. In other words, we would expect 95% of our 1000 estimated lines to fit within those blue lines. The gray lines we see outside of the blue lines are those 5% that don't fall within the 95% confidence band.

Getting started with GitHub Gist in WordPress

I’ve decided it’s time to use GitHub Gist for posting code snippets in my blog. My original motivation was better syntax highlighting. I was using WP Code Highlight, which isn’t bad. It’s easy to use and does what it promises. But it doesn’t add line numbers and the syntax highlighting doesn’t seem optimal for R. GitHub Gist on the other hand handles R code beautifully.

Now highlighted code is nice, but what’s really cool about Gist is that the code is actually stored in a repository. This means I can update the code without having to update my WordPress post. And it’s easier for other people to use should they ever want to use my code. And there are other benefits as well. I guess the only bad thing is that GitHub has my code, which would be bad if they went belly-up. I’m pretty sure that won’t happen though, because I don’t want to think about it.

So here’s how you use GitHub Gist with WordPress to host R code. First create an account on GitHub and go to GitHub Gist. Where it says “Name of file”, type the name of your file and add “.r” to the end. Next select R from the language pulldown, though Gist will probably select it for you if you typed a file name ending in “.r”. Now copy and paste your code into the editor and click “create public gist”. You’re done with GitbHub

To make this block of code appear in your WordPress post, copy the code in the “Embed this gist” field and paste into your post. (Make sure you’re in the Text editor instead of the Visual editor). It would be really nice if that was the end of it, but it’s not. WordPress strips out the embed code and nothing happens. The workaround is to add a little plug-in called Raw HTML. As the description says, “This plugin adds a set of shortcodes that you can use to “protect” specific parts of your post and prevent WP from messing with them.” Awesome, just what we need. So the final step is to surround the Gist embed code with [raw]…[/raw] shortcode. (A tip of my hat to this site for telling me about Raw HTML.

Let’s try this out. Here’s some code that samples 30 values from a N(15,2) distribution, calculates a 95% confidence interval, and plots the interval as a line. It does this 40 times and colors the CI’s that don’t capture 15 red.

To make that appear in my post, I added the following:

<script src=””></script>

Exploring Unordered Contrasts in R

Contrasts in R determine how linear model coefficients of categorical variables are interpreted. The default contrast for unordered categorical variables is the Treatment contrast. This means the “first” level (aka, the baseline) is rolled into the intercept and all subsequent levels have a coefficient that represents their difference from the baseline. That’s not too hard to grasp. But what about other contrasts, namely the Helmert and Sum contrasts? What do they do? Instead of explaining them, I figured I would demonstrate each.

First, let’s make some data:

# create 3 levels, 10 each
flevels <- factor(rep(c("A","B","C"),c(10,10,10))) 
# create some "nice" data, sorted so means at each level have good separation
vals <- sort(round(runif(30,3,15))) 
# calculate mean of each level for reference
means <- tapply(vals,flevels,mean) 
   A    B    C 
 6.0 10.1 12.9

Our data consist of three levels of arbitrary values. "flevels" is our categorical variable. Notice I explicitly defined it to be a factor using the factor() function. I need to do this so R knows this variable is a factor and codes it according to whatever contrast setting we decide to use.

Let's verify the default unordered contrast setting is the Treatment contrast:

        unordered           ordered 
"contr.treatment"      "contr.poly"

Indeed it is. This means our factor levels are coded as follows:

  B C
A 0 0
B 1 0
C 0 1

This is a 3 x 2 matrix. The 2 columns of the matrix tells us that our model will have 2 coefficients, one for the B level and one for the C level. Therefore the A level is the baseline. The coefficients we get in our linear model for B and C will indicate the respective differences of their mean from the level A mean. The values in the rows tell us what values to plug into the model to get the means for the row labels. For example, to get the mean for A we plug in 0's for both coefficients which leaves us with the intercept. Therefore the intercept is the mean of A. Let's see all this in action before we explore the Helmert and Sum contrasts.

m.trt <- lm(vals ~ flevels)

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.0000     0.3935  15.249 8.63e-15 ***
flevelsB      4.1000     0.5564   7.368 6.32e-08 ***
flevelsC      6.9000     0.5564  12.400 1.17e-12 ***

Now we can verify how the Treatment contrast works by extracting the coefficient values from the model and comparing to the means we calculated earlier:

# Intercept = mean of A
# flevelsB = mean of B - mean of A
means[2] - means[1]
# flevelsC = mean of C - mean of A
means[3] - means[1]

Let's also verify that plugging in the row values of the contrast matrix returns the means of each level:

# plug in row values into model to get the means of A, B and C, respectively
   A    B    C 
 6.0 10.1 12.9 
# mean of A --> row 1: 0 0
coef(m.trt)[1] + 0*coef(m.trt)[2] + 0*coef(m.trt)[3]
# mean of B --> row 2: 0 0
coef(m.trt)[1] + 1*coef(m.trt)[2] + 0*coef(m.trt)[3]
# mean of C --> row 3: 0 0
coef(m.trt)[1] + 0*coef(m.trt)[2] + 1*coef(m.trt)[3]

So that's how Treatment contrasts work. Now let's look at Helmert contrasts. "The coefficients for the Helmert regressors compare each level with the average of the "preceding" ones", says Fox in his book An R and S-Plus Companion to Applied Regression. I guess that makes sense. Kind of. Eh, not really. At least not to me. I say we do as we did before: fit a model and compare the coefficients to the means and see what they mean. Before we do that we need to set the contrast to Helmert:

# set contrast to "contr.helmert"
contrasts(flevels) <- "contr.helmert"
contrasts(flevels) # take a look
  [,1] [,2]
A   -1   -1
B    1   -1
C    0    2

Interesting. Notice the column labels are no longer associated with the levels of the factor. They just say 1 and 2. However this still tells us that our model will have two coefficients. Again the row values tell us what to plug in to get the means of A, B and C, respectively. To get the mean of A, we plug in -1 and -1 to the model. This means our intercept has a different interpretation. Let's fit the linear model and investigate.

m.hel <- lm(vals ~ flevels)

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6667     0.2272  42.553  < 2e-16 ***
flevels1      2.0500     0.2782   7.368 6.32e-08 ***
flevels2      1.6167     0.1606  10.064 1.24e-10 ***

It turns out the intercept is the mean of the means, the first coefficient is the mean of the first two levels minus the first level, and the second coefficient is the mean of all three levels minus the mean of the first two levels. Did you get that? Here, this may help:

# intercept = mean of all means
[1] 9.666667
# flevels1 = mean of first two levels minus first level
mean(means[1:2]) - means[1]
# flevels2 = mean of all three levels minus mean of first two levels
mean(means) - mean(means[1:2])
[1] 1.616667

Let's do that thing again where we plug in the row values of the contrast matrix to verify it returns the means of the levels:

   A    B    C 
 6.0 10.1 12.9 
# mean of A --> row 1: -1 -1
coef(m.hel)[1] + -1*coef(m.hel)[2] + -1*coef(m.hel)[3]
# mean of B --> row 2: 1 -1
coef(m.hel)[1] + 1*coef(m.hel)[2] + -1*coef(m.hel)[3]
# mean of C --> row 3: 0 2
coef(m.hel)[1] + 0*coef(m.hel)[2] + 2*coef(m.hel)[3]

That leaves us with the Sum contrast. Regarding models fitted with the Sum contrasts, Fox tells us that "each coefficient compares the corresponding level of the factor to the average of the other levels." I think like Helmert contrasts, this one is better demonstrated. As before we need to change the contrast setting.

# set contrast to "contr.sum"
contrasts(flevels) <- "contr.sum"
contrasts(flevels) # take a look
  [,1] [,2]
A    1    0
B    0    1
C   -1   -1

Just like the Helmert contrast we see two columns with no labels. Our model will have two coefficients that don't correspond directly to the levels of our factors. By now we know the values in the rows are what we plug into our model to get the means of our levels. To get the mean of level A, we plug in 1 and 0. Time to fit the model and investigate:

m.sum <- lm(vals ~ flevels)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6667     0.2272  42.553  < 2e-16 ***
flevels1     -3.6667     0.3213 -11.413 7.75e-12 ***
flevels2      0.4333     0.3213   1.349    0.189  

Like the Helmert contrasts, our intercept is mean of all means. But our two coefficients have different interpretations. The first is the mean of all means minus the mean of level 1 (A). The second coefficient is the mean of all means minus the mean of level 2 (B). Notice in the model output above that the second coefficient is not significant. In other words, the mean of level B is not significantly different from the mean of all means.

# intercept = mean of all means
[1] 9.666667
# flevels1 = mean of all means - mean of level 1 (here, A)
means[1] - mean(means)
# flevels2 = mean of all means - mean of level 2 (here, B)
means[2] - mean(means)

Finally to be complete we plug in the row values of the Sum contrast matrix to verify it returns the means of the factor levels:

   A    B    C 
 6.0 10.1 12.9 
# mean of A -->row 1: 1 0
coef(m.sum)[1] + 1*coef(m.sum)[2] + 0*coef(m.sum)[3]
# mean of B -->row 2: 0 1
coef(m.sum)[1] + 0*coef(m.sum)[2] + 1*coef(m.sum)[3]
# mean of C -->row 3: -1 -1
coef(m.sum)[1] + -1*coef(m.sum)[2] + -1*coef(m.sum)[3]

And finally we wrap up this exercise by returning the contrast level of our categorical variable back to the system default:

contrasts(flevels) <- NULL

Hopefully this helps you get a better handle on what Helmert and Sum contrasts do.

Modeling Discontinuous Change (Ch 6 of ALDA)

Chapter 6 of ALDA introduces strategies for fitting models in which individual change is discontinuous. This means the linear trajectory has a shift in the elevation and/or slope. To fit a model with discontinuous change we need to “include one (or more) time-varying predictor(s) that specify whether and, if so, when each person experiences the hypothesized shift.” (p. 191)

Some of the forms discontinuous change can take include:

  • an immediate shift in elevation, but no shift in slope
  • an immediate shift in slope, but no shift in elevation
  • immediate shifts in both elevation and slope
  • shifts in elevation (or slope) that differ in magnitude by time
  • multiple shifts in elevation (or slope) during multiple epochs of time

The example in the book replicates work done by Murnane, et al. (1999). In their paper they analyzed wage data for high school dropouts and investigated whether (log) wage trajectories remained smooth functions of work experience. Their idea was that obtaining a GED might command a higher wage and thus cause a discontinuity in the linear model fit to the data. The authors partially replicate this work by fitting a taxonomy of multilevel models.

Here’s the data (courtesy of UCLA IDRE). The variables of interest are:

  • id – person ID
  • lnw – natural log of wages (the response)
  • ged – indicator (1 = attained GED; 0 otherwise)
  • exper – years in labor force to nearest day
  • postexp – years in labor force from day of GED attainment
  • hgc.9 – highest grade completed, centered on grade 9
  • black – indicator (1 = black; 0 otherwise)
  • ue.7 – unemployment rate, centered on 7%

The initial model is rather elaborate due to earlier work in the book. Using the book’s notation we state the level-1 and level-2 models as follows:

Y_{ij} = \pi_{0i} + \pi_{1i}EXPER_{ij} + \pi_{2i}(UE_{ij} - 7) + \epsilon_{ij}
\pi_{0i} = \gamma_{00} + \gamma_{01}(HGC_{i}-9) + \xi_{0i}
\pi_{1i} = \gamma_{10} + \gamma_{12}BLACK_{i}  + \xi_{1i}
\pi_{2i} = \gamma_{20}
\epsilon_{ij} \sim N(0,\sigma_{\epsilon}^{2}
\begin{bmatrix} \xi_{0i}\\ \xi_{1i} \end{bmatrix} \sim N \begin{pmatrix} \begin{bmatrix} 0\\ 0 \end{bmatrix}, \begin{bmatrix} \sigma_{0}^{2} \sigma_{01}\\ \sigma_{10} \sigma_{1}^{2} \end{bmatrix} \end{pmatrix}

What a mess. Let’s break this down. The level-1 model is the individual growth model. It posits an individual’s wages can be explained by his years in the labor force (EXPER) and the unemployment rate (UE). The level-2 model says

  • the intercept in the level-1 model varies by highest grade completed (HGC)
  • the EXPER coefficient in the level-1 model varies based on whether or not your race is black (BLACK)
  • the UE coefficient is fixed for all individuals

If we collapse the two levels into one model we get

Y_{ij} = \gamma_{00} + \gamma_{01}(HGC_{i}-9) + \gamma_{10}EXPER_{ij} + \gamma_{12}EXPER_{ij} \times BLACK_{i} + \gamma_{20}(UE_{ij}-7) + \xi_{0i} + \xi_{1i}EXPER_{ij} + \epsilon_{ij}

That’s not exactly fun to look at either, but the last few terms reveal the random effects. The \xi_{0i} is the random effect for the intercept, \xi_{1i} is the random effect for the EXPER slope parameter and \epsilon_{ij} is the residual error.

We can fit this model in R as follows:

wages <- read.table("", 
          header=T, sep=",")
model.a <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + (exper|id), 
          wages, REML=FALSE)

The key part is the stuff in the parentheses. It says EXPER - and the intercept by default - are the random effects, and that they're grouped by ID (ie, the individuals). This means that each individual has his own intercept and EXPER coefficient in the fitted model. Let's look at the model's fixed effects and the random effects for individual 1.

The model's fixed effects:

(Intercept)       exper       hgc.9        ue.7 exper:black 
 1.74898840  0.04405394  0.04001100 -0.01195050 -0.01818322 

Random effects for first individual (ID = 31) in data:

   (Intercept)      exper
31  -0.1833142 0.02014632

To see the final model for this individual we add his random effects to the fixed effects:

ranef(model.a)$id[1,] + fixef(model.a)
   (Intercept)      exper
31    1.565674 0.06420026

Or we can just do this:

   (Intercept)      exper    hgc.9       ue.7 exper:black
31    1.565674 0.06420026 0.040011 -0.0119505 -0.01818322

Notice how the intercept and EXPER coefficient are different for the individual versus the fixed effects. Now we're usually less interested in the specific random effects (in this case there are 888 of them!) and more interested in their variances (or variance components). The variance component for the intercept is 0.051. The variance component for EXPER is 0.001. Those are pretty small but not negligible.

Having said all that, the goal of this exercise is to build the best model with discontinuities, which is largely done by deviance statistics. So let's work through the book's example and remember that everything I explained above is the baseline model. All subsequent models will build upon it.

The first model up adds a discontinuity in the intercept by including fixed and random effect for GED:

model.b <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + ged + 
                (exper + ged|id), wages, REML=FALSE)

To get a feel how GED affects the model, look at the records for individual 53:

wages[wages$id == 53,1:4]
   id   lnw exper ged
19 53 1.763 0.781   0
20 53 1.538 0.943   0
21 53 3.242 0.957   1
22 53 1.596 1.037   1
23 53 1.566 1.057   1
24 53 1.882 1.110   1
25 53 1.890 1.185   1
26 53 1.660 1.777   1

Notice how GED flips from 0 to 1 over time. Model B allows an individual's wage trajectory to shift in "elevation" at the point GED changes to 1. Hence the discontinuity. Should we allow this? Calling anova(model.a,model.b) helps us decide. In the output you'll see the p-value is less than 0.001. The null here is no difference between the models, i.e., the new explanatory variable in model B (GED) has no effect. So we reject the null and determine that an individual's wage trajectory may indeed display a discontinuity in elevation upon receipt of a GED.

Our next model is model B without GED random effects:

model.c <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                ged + (exper|id), wages, REML=FALSE)

This is our baseline model with an additional fixed effect for GED. Should we include random effects for GED? Again we test the null that there is no difference between model B and C by calling anova(model.c,model.b). Since model C is nested in model B and the p-value returned is about 0.005, we reject the null and decide to keep the GED random components.

The next two models explore a discontinuity in the slope of the wages trajectory but not the elevation. We do this by removing the GED fixed and random effects and replace them with POSTEXP fixed and random effects. Recall that POSTEXP is years in labor force from day of GED attainment. To see how it works, look at the records for individual 4384:

wages[wages$id == 4384,3:5]
     exper ged postexp
2206 0.096   0   0.000
2207 1.039   0   0.000
2208 1.726   1   0.000
2209 3.128   1   1.402
2210 4.282   1   2.556
2211 5.724   1   3.998
2212 6.024   1   4.298

The record where GED flips to 1 is the day he obtains his GED. The next record clocks the elapsed time in the workforce since obtaining his GED: 1.402 years. POSTEXP records this explicitly. But that's what EXPER records as well: 3.128 - 1.726 = 1.402. So these two variables record the passage of time in lockstep. It's just that EXPER records from the "beginning" and POSTEXP records from the day of GED attainment. Allowing this additional variable into the model allows the slope of the wages trajectory to suddenly change when a GED is obtained. OK, enough talking. Let's see if we need it.

model.d <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + postexp + 
                (exper + postexp|id), wages, REML=FALSE)

The p-value returned from the anova function is about 0.01. This says model D is an improvement over model A and that the trajectory slope of wages indeed changes upon receipt of GED.

The next model is model D without random effects for POSTEXP. So whereas previously we allowed the change in slope to vary across individuals (random), now we're saying the change in slope is uniform (fixed) for all individuals. In the R code this means removing POSTEXP from the random effects portion but keeping it as a fixed effect.

model.e <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + postexp + 
               (exper|id), wages, REML=FALSE)

The results from this anova test conclude no difference between the models (p-value = 0.34). This means we may not need to allow for POSTEXP random effects.

But before we go with that, let's fit a model that allows for discontinuity in both the slope and elevation of the individuals' wages trajectory. In other words, let's throw both GED and POSTEXP in the model as both fixed and random effects. And then let's compare the model with previous models to determine whether or not to keep each predictor.

model.f <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                postexp + ged + (postexp + ged + exper|id), 
                wages, REML=FALSE)

# compare models f and b to evaluate POSTEXP effect (NULL: no POSTEXP effect)

# compare models f and d to evaluate GED effect  (NULL: no GED effect)

In both anova tests, the p-values are very small, thus we reject the null in each case and retain the predictors. Therefore it appears that in the presence of the GED predictor that we do actually want to retain random effects for POSTEXP. But we're not done yet. Let's fit two more models each without the POSTEXP and GED random effects, respectively, and compare them to model F:

# MODEL G - Model F without POSTEXP variance component
model.g <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                postexp + ged + (ged + exper|id), wages, REML=FALSE)

# MODEL H - Model F without GED variance component
model.h <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                postexp + ged + (postexp + exper|id), wages, REML=FALSE)

# compare models g and h to model f to see if we should 
# keep POSTEXP and GED variance components
# NULL: do not need variance components

In both anova tests we get p-values less than 0.01 and reject the null. We should indeed keep the random effects. So our final model allows for both discontinuities in elevation and slope in the individuals' trajectories. Here's a super rough way we can visualize this in R:

# visual aid of discontinuity in slope and elevation
# made up model coefficients
b0 <- 3
b1 <- 5 
b2 <- 2
b3 <- 4
# variables
ind <- c(rep(0,10),rep(1,11)) # indicator of event
time <- c(1:10,10,11:20) # time
time2 <- c(rep(0,11),1:10)  # additional time tracking after event occurs
# create response
y <- b0 + b1*ind + b2*time + b3*time2
# plot response versus time
plot(time, y, type="l")

This gives us the following line plot:

Notice the shift in elevation at 10 and then the change in slope after the shift in elevation.

Treating Time More Flexibly (Ch 5 of ALDA)

Through 4 chapters of Applied Longitudinal Data Analysis (ALDA), the data sets have had the following constraints:

  • Balanced – all subjects have the same number of measurements.
  • Time structured – all subjects measured at the same time.
  • Time-invariant predictors – predictors that do not change over time, such as gender or treatment group.

In chapter 5 these constraints are relaxed. We work with unbalanced datasets with variably-spaced measurements and time-varying predictors. As usual, the UCLA stats consulting site replicates the chapter’s examples in 18 different stats programs. I won’t redo their work, but I will give you my boiled-down-most-important-points that I took away from this chapter. I’ll also show a couple of examples using the lmer() function from the lme4 package.

Section 5.1 Variably Spaced Measurement Occasions
Analyzing data sets with variably spaced measurement occasions is no different than analyzing data sets with identical occasions across individuals (time structured).

Example with unstructured data set (variably spaced measurements)
Data: reading scores recorded at three different times (i.e., 3 waves of data)
Fit two unconditional growth models

reading <- read.table("
                      alda/data/reading_pp.txt", header=T, sep=",")
mat2 <- reading[ ,3:4]-6.5
dimnames(mat2)[[2]] <- c("agegrp.c", "age.c")  
reading <- cbind(reading, mat2)

# forcing structure on data
lmer.agegrp <- lmer(piat ~ agegrp.c + (agegrp.c | id), reading, REML = FALSE)
# using unstructured data
lmer.age <- lmer(piat ~ age.c + (age.c | id), reading, REML = FALSE)

The first model treats the data as structured. Instead of using child’s precise age, we are using their age classification group (6.5, 8.5, 10.5). The second model uses the child’s precise age. Notice the second model’s lower deviance: 1803 versus 1820. “Treating the unstructured data as though it is time-structured introduces error in the analysis – error that we can reduce by using the child’s age at testing as the temporal predictor.” (p. 145)

Lesson: never force an unstructured data set to be structured.

Section 5.2 Varying Numbers of Measurement Occasions
Section 5.1 concerned varying spacing of measurements. This section concerns varying number of measurements . AKA Unbalanced data. Multilevel modeling allows analysis of data sets with varying numbers of waves of data.

All subjects can contribute to a multilevel model regardless of how many waves of data they contribute. No special procedures are needed to fit a multilevel model to unbalanced data, provided it’s not too unbalanced (i.e., too many people with too few waves with respect to the complexity of your specified model).

Potential Problems with unbalanced data

  • The iterative estimation algorithms may not converge. This affects variance components, not fixed effects. “Estimation of variance components requires that enough people have sufficient data to allow quantification of within-person residual variation.” (p. 152)
  • Exceeding boundary constraints, such as negative variance components. Your output may have an estimate of 0 to indicate this. Simplifying your model by removing random effects is usually the fix.
  • Nonconvergence. This can result from poorly specified models and insufficient data. Can also result from the outcome variable’s scale (too small, make larger) or the temporal predictor’s variable scale (too brief, make longer)

5.3 Time-Varying Predictors

A time-varying predictor is a variable whose values may differ over time. Examples: hours worked per week, money earned per year, employment status. No special strategies are needed to include a time-varying predictor in a multilevel model.

Examples with time-varying predictor
Data: depression scores (cesd) for unemployed; status of employment (unemp; 0 or 1) changes over time

unemployment <- read.table("
                           alda/data/unemployment_pp.txt", header=T, sep=",")

# time-varying predictor is unemp
lmer.unb <- lmer(cesd ~ months + unemp + (months | id), 
                 unemployment, REML = FALSE)

# allow effect of time-varying predictor (unemp) to vary over time
lmer.unc <- lmer(cesd ~ months + unemp*months + (months | id), 
                 unemployment, REML = FALSE)

# constant slope for unemp=0, changing slope for unemp=1
lmer.und <- lmer(cesd ~ unemp + unemp:months + (unemp + unemp:months | id), 
                 unemployment, REML = FALSE)

5.3 Recentering the Effect of Time

Recentering time can produce interpretive advantages such as an intercept that represents initial status. Time can also be recentered in such a way to produce an intercept that represents final status. This is useful when final status is of special concern. Changes in recentering produce different intercept parameters but leave slope and deviance statistics unchanged. It can also lead to an intercept being significant when it previously was not (and vice versa).

A Probability Problem in Heredity – Part 3

In my previous two posts I showed worked solutions to problems 2.5 and 11.7 in Bulmer’s Principles of Statistics, both of which involve the characteristics of self-fertilizing hybrid sweet peas. It turns out that problem 11.8 also involves this same topic, so why not work it as well for completeness. The problem asks us to assume that we were unable to find an explicit solution for the maximum likelihood equation in problem 11.7 and to solve it by using the following iterative method:

\( \theta_{1} = \theta_{0} + \frac{S(\theta_{0})}{I(\theta_{0})} \)

where \( S(\theta_{0}) \) is the value of \( \frac{d \log L}{d\theta}\) evaluated at \( \theta_{0}\) and \( I(\theta_{0})\) is the value of \( -E(\frac{d^{2}\log L}{d\theta^{2}})\) evaluated at \( \theta_{0}\).

So we begin with \( \theta_{0}\) and the iterative method returns \( \theta_{1}\). Now we run the iterative method again starting with \( \theta_{1}\) and get \( \theta_{2}\):

\( \theta_{2} = \theta_{1} + \frac{S(\theta_{1})}{I(\theta_{1})} \)

We repeat this process until we converge upon a value. This is called the Newton-Raphson method. Naturally this is something we would like to have the computer do for us.

First, recall our formulas from problem 11.7:

\( \frac{d \log L}{d\theta} = \frac{1528}{2 + \theta} – \frac{223}{1 – \theta} + \frac{381}{\theta} \)
\( \frac{d^{2}\log L}{d \theta^{2}} = -\frac{1528}{(2 + \theta)^{2}} -\frac{223}{(1 – \theta)^{2}} -\frac{381}{\theta^{2}} \)

Let’s write functions for those in R:

# maximum likelihood score
mls <- function(x) {
	1528/(2 + x) - 223/(1 - x) + 381/x
# the information
inf <- function(x) {
	-1528/((2 + x)^2) - 223/((1 - x)^2) - 381/(x^2)

Now we can use those functions in another function that will run the iterative method starting at a trial value:

# newton-raphson using expected information matrix
nr <- function(th) {
 prev <- th
 repeat {
   new <- prev + mls(prev)/-inf(prev)
   if(abs(prev - new)/abs(new) <0.0001)
   prev <- new

This function first takes its argument and names it "prev". Then it starts a repeating loop. The first thing the loop does it calculate the new value using the iterative formula. It then checks to see if the difference between the new and previous value - divided by the new value - is less than 0.0001. If it is, the loop breaks and the "new" value is printed to the console. If not, the loop repeats. Notice that each iteration is hopefully converging on a value. As it converges, the difference between the "prev" and "new" value will get smaller and smaller. So small that dividing the difference by the "new" value (or "prev" value for that matter) will begin to approach 0.

To run this function, we simply call it from the console. Let's start with a value of \( \theta_{0} = \frac{1}{4}\), as the problem suggests:

[1] 0.7844304

There you go! We could make the function tell us a little more by outputting the iterative values and number of iterations. Here's a super quick and dirty way to do that:

# newton-raphson using expected information matrix
nr <- function(th) {
 k <- 1 # number of iterations
 v <- c() # iterative values
  prev <- th
  repeat {
    new <- prev + mls(prev)/-inf(prev)
    v[k] <- new
    if(abs(prev - new)/abs(new) <0.0001)
    prev <- new
    k <- k + 1
print(new) # the value we converged on
print(v) # the iterative values
print(k) # number of iterations

Now when we run the function we get this:

[1] 0.7844304
[1] 0.5304977 0.8557780 0.8062570 0.7863259 0.7844441 0.7844304
[1] 6

We see it took 6 iterations to converge. And with that I think I've had my fill of heredity problems for a while.

Testing Composite Hypotheses about Fixed Effects (Ch 4 of ALDA)

I’m still on my ALDA kick here, this time posting about section 4.7 of Chapter 4. In my last post I talked deviance-based hypothesis tests in the context of model building. Recall you have to be aware of which method of estimation you used when you deploy the deviance statistic. Namely, if you used Restricted Maximum Likelihood (RML) to estimate model parameters, you can only use the deviance statistic to test hypotheses about variance components. This is an important point as many programs default to RML, such as the lme4 package in R and SAS PROC MIXED. But the deviance statistic is not the only tool for testing composite hypotheses.

That leads us to section 4.7 and the Wald statistic. The Wald statistic allows you to “test composite hypotheses about multiple effects regardless of the method of estimation used. This means that if you use restricted methods of estimation, which prevent you from using deviance-based tests to compare models with different fixed effects, you still have a means of testing composite hypotheses about sets of fixed effects.” (p. 122) Sounds good to me!

The authors give two examples, one of which I want to review in this post. As usual they don’t show you how to do the test using statistical software. Unfortunately either does the UCLA stats consulting page for ALDA. So I had to figure it out on my own.

Let’s reset the example study motivating the work in this chapter. 82 adolescents were surveyed on alcohol use. Some of the variables collected included:

  • alcuse, a rating-scale measure of alcohol use (the response)
  • age_14, age of participant centered about 14 (14 = 0, 15 = 1, 16 = 2)
  • coa, an indicator whether or not participant is a child of an alcoholic (1 = yes, 0 = no)
  • cpeer, a rating-scale measure of alcohol use among peers centered on its sample mean of 1.018
  • id, an arbitrary level to group persons

These variables are part of Model F, the model of interest in section 4.7, which aims to explain the variability in alcuse. (Models A through E precede Model F in the earlier model-building portion of the chapter.) Here’s Model F in its multilevel form:

level 1
Y_{ij} = \pi_{0i} + \pi_{1i}*age14_{ij} + \epsilon_{ij}

level 2
\pi_{0i} = \gamma_{00} + \gamma_{01}*coa + \gamma_{02}*cpeer + \zeta_{0i}
\pi_{1i} = \gamma_{10} + \gamma_{12}*cpeer + \zeta_{1i}

So this model posits that individuals have a liner trajectory over time (level 1), and that the parameters themselves of that linear trajectory differ between individuals based on coa and cpeer (level 2).

We can combine the two levels into one scary-looking composite representation of the model:
Y_{ij} = \gamma_{00} + \gamma_{01}*coa + \gamma_{02}*peer + \gamma_{10}*age14 + \gamma_{12}*peer*age14 + \zeta_{0i} + \zeta_{1i}*age14 + \epsilon_{ij}

Then we can estimate the parameters of that model in R with the following code:

alcohol1 <- read.table("", 
                       header=T, sep=",")
model.f1 <- lmer(alcuse ~ coa + cpeer*age_14 + (age_14 | id), alcohol1, REML = FALSE)

And now we are ready to test composite hypotheses about this model. The first example in the book asks whether the average child of non-alcoholic parents - with an average value of peer - drinks no alcohol at age 14 (intercept = 0) and remains abstinent over time (slope = 0). So we set coa and cpeer both equal to 0 in our composite model and we're left with this:

Y_{ij} = \gamma_{00} + \gamma_{10}*age14 + \zeta_{0i} + \zeta_{1i}*age14 + \epsilon_{ij}

Thus our question is essentially asking if the slope and intercept in this model are 0. Or to state it formally, our composite null hypothesis is as follows:

H_{0}: \gamma_{00} = 0 \: and \: \gamma_{10} = 0

Now to carry out this test, we need to express this hypothesis as a general linear hypothesis in matrix notation. First let's restate the hypothesis using our fixed effects:

1\gamma_{00} + 0\gamma_{01} + 0\gamma_{02} + 0\gamma_{10} + 0\gamma_{12} = 0
0\gamma_{00} + 0\gamma_{01} + 0\gamma_{02} + 1\gamma_{10} + 0\gamma_{12} = 0

We have weighted our coefficients so that the only two we're interested in are viable. Now we create a matrix of the weights. This is easier and faster to show in R than trying to use LaTeX, which I'm not even sure I can pull off with the Word Press plugin I'm using.

C <- matrix(c(1,0,0,0,0,0,0,0,1,0), nrow=2, byrow=TRUE)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    0    0    1    0

Now using the matrix we just created we can conduct a linear hypothesis test using the linearHypothesis function available in the car package, like so:

linearHypothesis(model.f1, C)

This returns a Wald statistic of 51.03 on 2 degrees of freedom, almost matching the book which reports 51.01. The p-value is practically 0, which means we reject this composite hypothesis.

Now it's nice to know there's an R function that will calculate the Wald statistic, but what is it? How can we find out? The following code reveals the source of linearHypothesis:


In it we see the following calculation:

SSH <- as.vector(t(L %*% b - rhs) %*% solve(L %*% V %*% t(L)) %*% 
        (L %*% b - rhs))

That's it. So we have some matrix multiplication happening. In this calculation L is our hypothesis matrix, b is our fixed effects, rhs means "right hand side" (which in our example is 0) and V is the variance-covariance matrix of the model parameters.

If we wanted to calculate the Wald statistic by hand, we could do the following:

# extract the model coefficients
b <- matrix(summary(model.f1)@coefs[,1],nrow=5)

# create the "right-hand side"
q <- matrix(c(0,0),nrow=2)

# extract the variance-covariance matrix
V <- vcov(model.f1)

# calculate the Wald statistic
W <- t(C%*%b - q) %*% solve(C%*% V %*%t(C)) %*% (C%*%b - q)

To calculate the p-value, use the pchisq function:


Comparing Multilevel Models using Deviance Statistics (Ch 4 of ALDA)

The tour of Applied Longitudinal Data Analysis (ALDA) by Singer and Willett continues today with section 4.6, Comparing Models using Deviance Statistics. In the section prior to this they walk through building a model by way of examining hypothesis tests for fixed effects and variance components. While the former will be familiar to those who’ve done classical linear regression, the latter is likely something new. And perhaps not advised. As I mentioned in my previous post (last paragraph), they state in section 3.6.2 that “statisticians disagree as to the nature, form, and effectiveness of these tests.” They also “suggest you examine them only with extreme caution.” Therefore I decided not to blog about that particular tactic and instead focus on “a superior method for testing hypotheses about variance components.” (Also their words.) Of course I refer to the title of this post: Deviance Statistics.

As I’ve done in my other ALDA posts, I’m going to forgo exposition and get down to business. This post is meant to serve as a reference guide for my future self and maybe yours as well.

  • The Deviance Statistic is used to test the hypothesis that additional model predictors do not improve the fit of the model. The null hypothesis is that the coefficients of the additional predictors are 0.
  • To use the Deviance Statistic, one model must be nested in the other. That is, the smaller model can be derived from the bigger model by setting certain coefficients in the bigger model equal to 0.
  • Deviance = -2 * (Log Likelihood (LL) of model)
  • Deviance Statistic = -2 * (LL of model nested in bigger model – LL of bigger model)
  • Smaller Deviance is better. If adding more predictors to a model reduces deviance, that may be a good thing. The hypothesis test using the Deviance Statistic helps us determine whether or not the reduction in deviance is significant. A large p-value tells us no, it is not significant and that our model is not improved by the additional predictors. A small p-value tells us to reject the null and keep the extra predictors.
  • The distribution of the deviance statistic is chi-square with DF equal to the number of extra parameters in the bigger model.
  • Deviance obtained under Restricted Maximum Likelihood (REML) should only be used if the two models compared have the same fixed effects and differ only in their random effects. If this is not the case, the deviance obtained using Full ML should be used instead.


The example in Chapter 4 of ALDA involves alcohol use by adolescents. 82 were surveyed over time (3 waves). Some of the data collected include:

alcuse, a continuous measure of alcohol use based on a rating scale (the response)
age_14, age of participant centered about 14 (14 = 0, 15 = 1, 16 = 2)
coa, an indicator whether or not participant is a child of an alcoholic (1 = yes, 0 = no)
id, an arbitrary level to group persons

The model building process is reproduced in R on the UCLA stats consulting site. They use the nlme package. I will use the lme4 package below to demonstrate the use of the deviance statistic.

# read in and attach the data
alcohol1 <- read.table("", header=T, sep=",")

We're going to fit a model that only has age_14 as a predictor. Then we're going to build a model that has age_14 and coa as predictors. Notice the first model is "nested" in the second. In other words we can get the first model from the second model by setting the coa coefficients to 0.

alcuse = \gamma_{00} + \gamma_{10}*age14 + \zeta + \zeta*age14 + \epsilon

alcuse = \gamma_{00} + \gamma_{10}*age14 + \gamma_{01}*coa + \gamma_{11}*age14*coa + \zeta + \zeta_{1i}*age14 + \epsilon

Is the second model better than the first? The null hypothesis is no, it is not better.

H_{0}: \gamma_{01} = \gamma_{11} = 0

The second model has two additional fixed effects and no change in the random effects. Therefore to carry out this test, both models need to be fitted using Full Maximum Likelihood. (Note the argument "REML = FALSE" in the calls to lmer() below.)

model.b1 <- lmer(alcuse ~ age_14 + (age_14 | id), alcohol1, REML = FALSE)

model.c1 <- lmer(alcuse ~ age_14*coa + (age_14 | id), alcohol1, REML = FALSE)

Now we're ready to carry out the test. We can access the deviance of each model from the summary object, like so:

[1] 636.6111
[1] 621.2026

Notice the deviance of the bigger model is smaller than the deviance of the nested model. Is the reduction in deviance significant? To carry out the test we take the deviance of the smaller nested model and subtract from it the deviance of the bigger model. The difference is then compared to a chi-square distribution for significance. In this case, we'll compare the difference to a chi-square distribution with 2 degrees of freedom since the bigger model has two extra coefficients.

dev <- summary(model.b1)@AICtab$deviance - summary(model.c1)@AICtab$deviance
[1] 15.40846
1 - pchisq(dev,2)
[1] 0.0004509163

Now that's a small p-value. That's the probability we would see a difference this large (or larger) in deviance if the two coefficients really were 0. We reject the null hypothesis and conclude our model is improved by including the two coefficients associated with the coa predictor. If we were planning to do several such tests, we could write a function to make the process go a little faster.

# function to calculate deviance statistic and return p-value
# a = nested model object, b = bigger model object, df = degrees of freedom
dev <- function(a,b,df){
return(1 - pchisq(
	         summary(a)@AICtab$deviance - 

[1] 0.0004509163