Exploring Unordered Contrasts in R

Contrasts in R determine how linear model coefficients of categorical variables are interpreted. The default contrast for unordered categorical variables is the Treatment contrast. This means the “first” level (aka, the baseline) is rolled into the intercept and all subsequent levels have a coefficient that represents their difference from the baseline. That’s not too hard to grasp. But what about other contrasts, namely the Helmert and Sum contrasts? What do they do? Instead of explaining them, I figured I would demonstrate each.

First, let’s make some data:

# create 3 levels, 10 each
flevels <- factor(rep(c("A","B","C"),c(10,10,10))) 
# create some "nice" data, sorted so means at each level have good separation
vals <- sort(round(runif(30,3,15))) 
# calculate mean of each level for reference
means <- tapply(vals,flevels,mean) 
   A    B    C 
 6.0 10.1 12.9

Our data consist of three levels of arbitrary values. "flevels" is our categorical variable. Notice I explicitly defined it to be a factor using the factor() function. I need to do this so R knows this variable is a factor and codes it according to whatever contrast setting we decide to use.

Let's verify the default unordered contrast setting is the Treatment contrast:

        unordered           ordered 
"contr.treatment"      "contr.poly"

Indeed it is. This means our factor levels are coded as follows:

  B C
A 0 0
B 1 0
C 0 1

This is a 3 x 2 matrix. The 2 columns of the matrix tells us that our model will have 2 coefficients, one for the B level and one for the C level. Therefore the A level is the baseline. The coefficients we get in our linear model for B and C will indicate the respective differences of their mean from the level A mean. The values in the rows tell us what values to plug into the model to get the means for the row labels. For example, to get the mean for A we plug in 0's for both coefficients which leaves us with the intercept. Therefore the intercept is the mean of A. Let's see all this in action before we explore the Helmert and Sum contrasts.

m.trt <- lm(vals ~ flevels)

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.0000     0.3935  15.249 8.63e-15 ***
flevelsB      4.1000     0.5564   7.368 6.32e-08 ***
flevelsC      6.9000     0.5564  12.400 1.17e-12 ***

Now we can verify how the Treatment contrast works by extracting the coefficient values from the model and comparing to the means we calculated earlier:

# Intercept = mean of A
# flevelsB = mean of B - mean of A
means[2] - means[1]
# flevelsC = mean of C - mean of A
means[3] - means[1]

Let's also verify that plugging in the row values of the contrast matrix returns the means of each level:

# plug in row values into model to get the means of A, B and C, respectively
   A    B    C 
 6.0 10.1 12.9 
# mean of A --> row 1: 0 0
coef(m.trt)[1] + 0*coef(m.trt)[2] + 0*coef(m.trt)[3]
# mean of B --> row 2: 0 0
coef(m.trt)[1] + 1*coef(m.trt)[2] + 0*coef(m.trt)[3]
# mean of C --> row 3: 0 0
coef(m.trt)[1] + 0*coef(m.trt)[2] + 1*coef(m.trt)[3]

So that's how Treatment contrasts work. Now let's look at Helmert contrasts. "The coefficients for the Helmert regressors compare each level with the average of the "preceding" ones", says Fox in his book An R and S-Plus Companion to Applied Regression. I guess that makes sense. Kind of. Eh, not really. At least not to me. I say we do as we did before: fit a model and compare the coefficients to the means and see what they mean. Before we do that we need to set the contrast to Helmert:

# set contrast to "contr.helmert"
contrasts(flevels) <- "contr.helmert"
contrasts(flevels) # take a look
  [,1] [,2]
A   -1   -1
B    1   -1
C    0    2

Interesting. Notice the column labels are no longer associated with the levels of the factor. They just say 1 and 2. However this still tells us that our model will have two coefficients. Again the row values tell us what to plug in to get the means of A, B and C, respectively. To get the mean of A, we plug in -1 and -1 to the model. This means our intercept has a different interpretation. Let's fit the linear model and investigate.

m.hel <- lm(vals ~ flevels)

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6667     0.2272  42.553  < 2e-16 ***
flevels1      2.0500     0.2782   7.368 6.32e-08 ***
flevels2      1.6167     0.1606  10.064 1.24e-10 ***

It turns out the intercept is the mean of the means, the first coefficient is the mean of the first two levels minus the first level, and the second coefficient is the mean of all three levels minus the mean of the first two levels. Did you get that? Here, this may help:

# intercept = mean of all means
[1] 9.666667
# flevels1 = mean of first two levels minus first level
mean(means[1:2]) - means[1]
# flevels2 = mean of all three levels minus mean of first two levels
mean(means) - mean(means[1:2])
[1] 1.616667

Let's do that thing again where we plug in the row values of the contrast matrix to verify it returns the means of the levels:

   A    B    C 
 6.0 10.1 12.9 
# mean of A --> row 1: -1 -1
coef(m.hel)[1] + -1*coef(m.hel)[2] + -1*coef(m.hel)[3]
# mean of B --> row 2: 1 -1
coef(m.hel)[1] + 1*coef(m.hel)[2] + -1*coef(m.hel)[3]
# mean of C --> row 3: 0 2
coef(m.hel)[1] + 0*coef(m.hel)[2] + 2*coef(m.hel)[3]

That leaves us with the Sum contrast. Regarding models fitted with the Sum contrasts, Fox tells us that "each coefficient compares the corresponding level of the factor to the average of the other levels." I think like Helmert contrasts, this one is better demonstrated. As before we need to change the contrast setting.

# set contrast to "contr.sum"
contrasts(flevels) <- "contr.sum"
contrasts(flevels) # take a look
  [,1] [,2]
A    1    0
B    0    1
C   -1   -1

Just like the Helmert contrast we see two columns with no labels. Our model will have two coefficients that don't correspond directly to the levels of our factors. By now we know the values in the rows are what we plug into our model to get the means of our levels. To get the mean of level A, we plug in 1 and 0. Time to fit the model and investigate:

m.sum <- lm(vals ~ flevels)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6667     0.2272  42.553  < 2e-16 ***
flevels1     -3.6667     0.3213 -11.413 7.75e-12 ***
flevels2      0.4333     0.3213   1.349    0.189  

Like the Helmert contrasts, our intercept is mean of all means. But our two coefficients have different interpretations. The first is the difference between the mean of level 1 (A) and the mean of all means. The second coefficient is the difference between the mean of level 2 (B) and the mean of all means. Notice in the model output above that the second coefficient is not significant. In other words, the mean of level B is not significantly different from the mean of all means.

# intercept = mean of all means
[1] 9.666667
# flevels1 = mean of level 1 (here, A) - mean of all means 
means[1] - mean(means)
# flevels2 = mean of level 2 (here, B) - mean of all means 
means[2] - mean(means)

Finally to be complete we plug in the row values of the Sum contrast matrix to verify it returns the means of the factor levels:

   A    B    C 
 6.0 10.1 12.9 
# mean of A -->row 1: 1 0
coef(m.sum)[1] + 1*coef(m.sum)[2] + 0*coef(m.sum)[3]
# mean of B -->row 2: 0 1
coef(m.sum)[1] + 0*coef(m.sum)[2] + 1*coef(m.sum)[3]
# mean of C -->row 3: -1 -1
coef(m.sum)[1] + -1*coef(m.sum)[2] + -1*coef(m.sum)[3]

And finally we wrap up this exercise by returning the contrast level of our categorical variable back to the system default:

contrasts(flevels) <- NULL

Hopefully this helps you get a better handle on what Helmert and Sum contrasts do.

Modeling Discontinuous Change (Ch 6 of ALDA)

Chapter 6 of ALDA introduces strategies for fitting models in which individual change is discontinuous. This means the linear trajectory has a shift in the elevation and/or slope. To fit a model with discontinuous change we need to “include one (or more) time-varying predictor(s) that specify whether and, if so, when each person experiences the hypothesized shift.” (p. 191)

Some of the forms discontinuous change can take include:

  • an immediate shift in elevation, but no shift in slope
  • an immediate shift in slope, but no shift in elevation
  • immediate shifts in both elevation and slope
  • shifts in elevation (or slope) that differ in magnitude by time
  • multiple shifts in elevation (or slope) during multiple epochs of time

The example in the book replicates work done by Murnane, et al. (1999). In their paper they analyzed wage data for high school dropouts and investigated whether (log) wage trajectories remained smooth functions of work experience. Their idea was that obtaining a GED might command a higher wage and thus cause a discontinuity in the linear model fit to the data. The authors partially replicate this work by fitting a taxonomy of multilevel models.

Here’s the data (courtesy of UCLA IDRE). The variables of interest are:

  • id – person ID
  • lnw – natural log of wages (the response)
  • ged – indicator (1 = attained GED; 0 otherwise)
  • exper – years in labor force to nearest day
  • postexp – years in labor force from day of GED attainment
  • hgc.9 – highest grade completed, centered on grade 9
  • black – indicator (1 = black; 0 otherwise)
  • ue.7 – unemployment rate, centered on 7%

The initial model is rather elaborate due to earlier work in the book. Using the book’s notation we state the level-1 and level-2 models as follows:

Y_{ij} = \pi_{0i} + \pi_{1i}EXPER_{ij} + \pi_{2i}(UE_{ij} - 7) + \epsilon_{ij}
\pi_{0i} = \gamma_{00} + \gamma_{01}(HGC_{i}-9) + \xi_{0i}
\pi_{1i} = \gamma_{10} + \gamma_{12}BLACK_{i}  + \xi_{1i}
\pi_{2i} = \gamma_{20}
\epsilon_{ij} \sim N(0,\sigma_{\epsilon}^{2}
\begin{bmatrix} \xi_{0i}\\ \xi_{1i} \end{bmatrix} \sim N \begin{pmatrix} \begin{bmatrix} 0\\ 0 \end{bmatrix}, \begin{bmatrix} \sigma_{0}^{2} \sigma_{01}\\ \sigma_{10} \sigma_{1}^{2} \end{bmatrix} \end{pmatrix}

What a mess. Let’s break this down. The level-1 model is the individual growth model. It posits an individual’s wages can be explained by his years in the labor force (EXPER) and the unemployment rate (UE). The level-2 model says

  • the intercept in the level-1 model varies by highest grade completed (HGC)
  • the EXPER coefficient in the level-1 model varies based on whether or not your race is black (BLACK)
  • the UE coefficient is fixed for all individuals

If we collapse the two levels into one model we get

Y_{ij} = \gamma_{00} + \gamma_{01}(HGC_{i}-9) + \gamma_{10}EXPER_{ij} + \gamma_{12}EXPER_{ij} \times BLACK_{i} + \gamma_{20}(UE_{ij}-7) + \xi_{0i} + \xi_{1i}EXPER_{ij} + \epsilon_{ij}

That’s not exactly fun to look at either, but the last few terms reveal the random effects. The \xi_{0i} is the random effect for the intercept, \xi_{1i} is the random effect for the EXPER slope parameter and \epsilon_{ij} is the residual error.

We can fit this model in R as follows:

wages <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/wages_pp.txt", 
          header=T, sep=",")
model.a <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + (exper|id), 
          wages, REML=FALSE)

The key part is the stuff in the parentheses. It says EXPER - and the intercept by default - are the random effects, and that they're grouped by ID (ie, the individuals). This means that each individual has his own intercept and EXPER coefficient in the fitted model. Let's look at the model's fixed effects and the random effects for individual 1.

The model's fixed effects:

(Intercept)       exper       hgc.9        ue.7 exper:black 
 1.74898840  0.04405394  0.04001100 -0.01195050 -0.01818322 

Random effects for first individual (ID = 31) in data:

   (Intercept)      exper
31  -0.1833142 0.02014632

To see the final model for this individual we add his random effects to the fixed effects:

ranef(model.a)$id[1,] + fixef(model.a)
   (Intercept)      exper
31    1.565674 0.06420026

Or we can just do this:

   (Intercept)      exper    hgc.9       ue.7 exper:black
31    1.565674 0.06420026 0.040011 -0.0119505 -0.01818322

Notice how the intercept and EXPER coefficient are different for the individual versus the fixed effects. Now we're usually less interested in the specific random effects (in this case there are 888 of them!) and more interested in their variances (or variance components). The variance component for the intercept is 0.051. The variance component for EXPER is 0.001. Those are pretty small but not negligible.

Having said all that, the goal of this exercise is to build the best model with discontinuities, which is largely done by deviance statistics. So let's work through the book's example and remember that everything I explained above is the baseline model. All subsequent models will build upon it.

The first model up adds a discontinuity in the intercept by including fixed and random effect for GED:

model.b <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + ged + 
                (exper + ged|id), wages, REML=FALSE)

To get a feel how GED affects the model, look at the records for individual 53:

wages[wages$id == 53,1:4]
   id   lnw exper ged
19 53 1.763 0.781   0
20 53 1.538 0.943   0
21 53 3.242 0.957   1
22 53 1.596 1.037   1
23 53 1.566 1.057   1
24 53 1.882 1.110   1
25 53 1.890 1.185   1
26 53 1.660 1.777   1

Notice how GED flips from 0 to 1 over time. Model B allows an individual's wage trajectory to shift in "elevation" at the point GED changes to 1. Hence the discontinuity. Should we allow this? Calling anova(model.a,model.b) helps us decide. In the output you'll see the p-value is less than 0.001. The null here is no difference between the models, i.e., the new explanatory variable in model B (GED) has no effect. So we reject the null and determine that an individual's wage trajectory may indeed display a discontinuity in elevation upon receipt of a GED.

Our next model is model B without GED random effects:

model.c <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                ged + (exper|id), wages, REML=FALSE)

This is our baseline model with an additional fixed effect for GED. Should we include random effects for GED? Again we test the null that there is no difference between model B and C by calling anova(model.c,model.b). Since model C is nested in model B and the p-value returned is about 0.005, we reject the null and decide to keep the GED random components.

The next two models explore a discontinuity in the slope of the wages trajectory but not the elevation. We do this by removing the GED fixed and random effects and replace them with POSTEXP fixed and random effects. Recall that POSTEXP is years in labor force from day of GED attainment. To see how it works, look at the records for individual 4384:

wages[wages$id == 4384,3:5]
     exper ged postexp
2206 0.096   0   0.000
2207 1.039   0   0.000
2208 1.726   1   0.000
2209 3.128   1   1.402
2210 4.282   1   2.556
2211 5.724   1   3.998
2212 6.024   1   4.298

The record where GED flips to 1 is the day he obtains his GED. The next record clocks the elapsed time in the workforce since obtaining his GED: 1.402 years. POSTEXP records this explicitly. But that's what EXPER records as well: 3.128 - 1.726 = 1.402. So these two variables record the passage of time in lockstep. It's just that EXPER records from the "beginning" and POSTEXP records from the day of GED attainment. Allowing this additional variable into the model allows the slope of the wages trajectory to suddenly change when a GED is obtained. OK, enough talking. Let's see if we need it.

model.d <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + postexp + 
                (exper + postexp|id), wages, REML=FALSE)

The p-value returned from the anova function is about 0.01. This says model D is an improvement over model A and that the trajectory slope of wages indeed changes upon receipt of GED.

The next model is model D without random effects for POSTEXP. So whereas previously we allowed the change in slope to vary across individuals (random), now we're saying the change in slope is uniform (fixed) for all individuals. In the R code this means removing POSTEXP from the random effects portion but keeping it as a fixed effect.

model.e <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + postexp + 
               (exper|id), wages, REML=FALSE)

The results from this anova test conclude no difference between the models (p-value = 0.34). This means we may not need to allow for POSTEXP random effects.

But before we go with that, let's fit a model that allows for discontinuity in both the slope and elevation of the individuals' wages trajectory. In other words, let's throw both GED and POSTEXP in the model as both fixed and random effects. And then let's compare the model with previous models to determine whether or not to keep each predictor.

model.f <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                postexp + ged + (postexp + ged + exper|id), 
                wages, REML=FALSE)

# compare models f and b to evaluate POSTEXP effect (NULL: no POSTEXP effect)

# compare models f and d to evaluate GED effect  (NULL: no GED effect)

In both anova tests, the p-values are very small, thus we reject the null in each case and retain the predictors. Therefore it appears that in the presence of the GED predictor that we do actually want to retain random effects for POSTEXP. But we're not done yet. Let's fit two more models each without the POSTEXP and GED random effects, respectively, and compare them to model F:

# MODEL G - Model F without POSTEXP variance component
model.g <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                postexp + ged + (ged + exper|id), wages, REML=FALSE)

# MODEL H - Model F without GED variance component
model.h <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                postexp + ged + (postexp + exper|id), wages, REML=FALSE)

# compare models g and h to model f to see if we should 
# keep POSTEXP and GED variance components
# NULL: do not need variance components

In both anova tests we get p-values less than 0.01 and reject the null. We should indeed keep the random effects. So our final model allows for both discontinuities in elevation and slope in the individuals' trajectories. Here's a super rough way we can visualize this in R:

# visual aid of discontinuity in slope and elevation
# made up model coefficients
b0 <- 3
b1 <- 5 
b2 <- 2
b3 <- 4
# variables
ind <- c(rep(0,10),rep(1,11)) # indicator of event
time <- c(1:10,10,11:20) # time
time2 <- c(rep(0,11),1:10)  # additional time tracking after event occurs
# create response
y <- b0 + b1*ind + b2*time + b3*time2
# plot response versus time
plot(time, y, type="l")

This gives us the following line plot:

Notice the shift in elevation at 10 and then the change in slope after the shift in elevation.

Treating Time More Flexibly (Ch 5 of ALDA)

Through 4 chapters of Applied Longitudinal Data Analysis (ALDA), the data sets have had the following constraints:

  • Balanced – all subjects have the same number of measurements.
  • Time structured – all subjects measured at the same time.
  • Time-invariant predictors – predictors that do not change over time, such as gender or treatment group.

In chapter 5 these constraints are relaxed. We work with unbalanced datasets with variably-spaced measurements and time-varying predictors. As usual, the UCLA stats consulting site replicates the chapter’s examples in 18 different stats programs. I won’t redo their work, but I will give you my boiled-down-most-important-points that I took away from this chapter. I’ll also show a couple of examples using the lmer() function from the lme4 package.

Section 5.1 Variably Spaced Measurement Occasions
Analyzing data sets with variably spaced measurement occasions is no different than analyzing data sets with identical occasions across individuals (time structured).

Example with unstructured data set (variably spaced measurements)
Data: reading scores recorded at three different times (i.e., 3 waves of data)
Fit two unconditional growth models

reading <- read.table("http://www.ats.ucla.edu/stat/r/examples/
                      alda/data/reading_pp.txt", header=T, sep=",")
mat2 <- reading[ ,3:4]-6.5
dimnames(mat2)[[2]] <- c("agegrp.c", "age.c")  
reading <- cbind(reading, mat2)

# forcing structure on data
lmer.agegrp <- lmer(piat ~ agegrp.c + (agegrp.c | id), reading, REML = FALSE)
# using unstructured data
lmer.age <- lmer(piat ~ age.c + (age.c | id), reading, REML = FALSE)

The first model treats the data as structured. Instead of using child’s precise age, we are using their age classification group (6.5, 8.5, 10.5). The second model uses the child’s precise age. Notice the second model’s lower deviance: 1803 versus 1820. “Treating the unstructured data as though it is time-structured introduces error in the analysis – error that we can reduce by using the child’s age at testing as the temporal predictor.” (p. 145)

Lesson: never force an unstructured data set to be structured.

Section 5.2 Varying Numbers of Measurement Occasions
Section 5.1 concerned varying spacing of measurements. This section concerns varying number of measurements . AKA Unbalanced data. Multilevel modeling allows analysis of data sets with varying numbers of waves of data.

All subjects can contribute to a multilevel model regardless of how many waves of data they contribute. No special procedures are needed to fit a multilevel model to unbalanced data, provided it’s not too unbalanced (i.e., too many people with too few waves with respect to the complexity of your specified model).

Potential Problems with unbalanced data

  • The iterative estimation algorithms may not converge. This affects variance components, not fixed effects. “Estimation of variance components requires that enough people have sufficient data to allow quantification of within-person residual variation.” (p. 152)
  • Exceeding boundary constraints, such as negative variance components. Your output may have an estimate of 0 to indicate this. Simplifying your model by removing random effects is usually the fix.
  • Nonconvergence. This can result from poorly specified models and insufficient data. Can also result from the outcome variable’s scale (too small, make larger) or the temporal predictor’s variable scale (too brief, make longer)

5.3 Time-Varying Predictors

A time-varying predictor is a variable whose values may differ over time. Examples: hours worked per week, money earned per year, employment status. No special strategies are needed to include a time-varying predictor in a multilevel model.

Examples with time-varying predictor
Data: depression scores (cesd) for unemployed; status of employment (unemp; 0 or 1) changes over time

unemployment <- read.table("http://www.ats.ucla.edu/stat/r/examples/
                           alda/data/unemployment_pp.txt", header=T, sep=",")

# time-varying predictor is unemp
lmer.unb <- lmer(cesd ~ months + unemp + (months | id), 
                 unemployment, REML = FALSE)

# allow effect of time-varying predictor (unemp) to vary over time
lmer.unc <- lmer(cesd ~ months + unemp*months + (months | id), 
                 unemployment, REML = FALSE)

# constant slope for unemp=0, changing slope for unemp=1
lmer.und <- lmer(cesd ~ unemp + unemp:months + (unemp + unemp:months | id), 
                 unemployment, REML = FALSE)

5.3 Recentering the Effect of Time

Recentering time can produce interpretive advantages such as an intercept that represents initial status. Time can also be recentered in such a way to produce an intercept that represents final status. This is useful when final status is of special concern. Changes in recentering produce different intercept parameters but leave slope and deviance statistics unchanged. It can also lead to an intercept being significant when it previously was not (and vice versa).

A Probability Problem in Heredity – Part 3

In my previous two posts I showed worked solutions to problems 2.5 and 11.7 in Bulmer’s Principles of Statistics, both of which involve the characteristics of self-fertilizing hybrid sweet peas. It turns out that problem 11.8 also involves this same topic, so why not work it as well for completeness. The problem asks us to assume that we were unable to find an explicit solution for the maximum likelihood equation in problem 11.7 and to solve it by using the following iterative method:

\( \theta_{1} = \theta_{0} + \frac{S(\theta_{0})}{I(\theta_{0})} \)

where \( S(\theta_{0}) \) is the value of \( \frac{d \log L}{d\theta}\) evaluated at \( \theta_{0}\) and \( I(\theta_{0})\) is the value of \( -E(\frac{d^{2}\log L}{d\theta^{2}})\) evaluated at \( \theta_{0}\).

So we begin with \( \theta_{0}\) and the iterative method returns \( \theta_{1}\). Now we run the iterative method again starting with \( \theta_{1}\) and get \( \theta_{2}\):

\( \theta_{2} = \theta_{1} + \frac{S(\theta_{1})}{I(\theta_{1})} \)

We repeat this process until we converge upon a value. This is called the Newton-Raphson method. Naturally this is something we would like to have the computer do for us.

First, recall our formulas from problem 11.7:

\( \frac{d \log L}{d\theta} = \frac{1528}{2 + \theta} – \frac{223}{1 – \theta} + \frac{381}{\theta} \)
\( \frac{d^{2}\log L}{d \theta^{2}} = -\frac{1528}{(2 + \theta)^{2}} -\frac{223}{(1 – \theta)^{2}} -\frac{381}{\theta^{2}} \)

Let’s write functions for those in R:

# maximum likelihood score
mls <- function(x) {
	1528/(2 + x) - 223/(1 - x) + 381/x
# the information
inf <- function(x) {
	-1528/((2 + x)^2) - 223/((1 - x)^2) - 381/(x^2)

Now we can use those functions in another function that will run the iterative method starting at a trial value:

# newton-raphson using expected information matrix
nr <- function(th) {
 prev <- th
 repeat {
   new <- prev + mls(prev)/-inf(prev)
   if(abs(prev - new)/abs(new) <0.0001)
   prev <- new

This function first takes its argument and names it "prev". Then it starts a repeating loop. The first thing the loop does it calculate the new value using the iterative formula. It then checks to see if the difference between the new and previous value - divided by the new value - is less than 0.0001. If it is, the loop breaks and the "new" value is printed to the console. If not, the loop repeats. Notice that each iteration is hopefully converging on a value. As it converges, the difference between the "prev" and "new" value will get smaller and smaller. So small that dividing the difference by the "new" value (or "prev" value for that matter) will begin to approach 0.

To run this function, we simply call it from the console. Let's start with a value of \( \theta_{0} = \frac{1}{4}\), as the problem suggests:

[1] 0.7844304

There you go! We could make the function tell us a little more by outputting the iterative values and number of iterations. Here's a super quick and dirty way to do that:

# newton-raphson using expected information matrix
nr <- function(th) {
 k <- 1 # number of iterations
 v <- c() # iterative values
  prev <- th
  repeat {
    new <- prev + mls(prev)/-inf(prev)
    v[k] <- new
    if(abs(prev - new)/abs(new) <0.0001)
    prev <- new
    k <- k + 1
print(new) # the value we converged on
print(v) # the iterative values
print(k) # number of iterations

Now when we run the function we get this:

[1] 0.7844304
[1] 0.5304977 0.8557780 0.8062570 0.7863259 0.7844441 0.7844304
[1] 6

We see it took 6 iterations to converge. And with that I think I've had my fill of heredity problems for a while.

A Probability Problem in Heredity – Part 2

In my last post I solved a problem from chapter 2 of M.G. Bulmer’s Principles of Statistics. In this post I work through a problem in chapter 11 that is basically a continuation of the chapter 2 problem. If you take a look at the previous post, you’ll notice we were asked to find probability in terms of theta. I did it and that’s nice and all, but we can go further. We actually have data, so we can estimate theta. And that’s what the problem in chapter 11 asks us to do. If you’re wondering why it took 9 chapters to get from finding theta to estimating theta, that’s because the first problem involved basic probability and this one requires maximum likelihood. It’s a bit of a jump where statistical background is concerned.

The results of the last post were as follows:

purple-flowered red-flowered
long pollen \( \frac{1}{4}(\theta + 2)\) \( \frac{1}{4}(1 – \theta) \)
round pollen \( \frac{1}{4}(1 – \theta) \) \( \frac{1}{4}\theta \)

The table provides probabilities of four possible phenotypes when hybrid sweet peas are allowed to self-fertilize. For example, the probability of a self-fertilizing sweet pea producing a purple-flower with long pollen is \( \frac{1}{4}(1 – \theta)\). In this post we’ll estimate theta from our data. Recall that \( \theta = (1 – \pi)^{2} \), where \( \pi \) is the probability of the dominant and recessive genes of a characteristic switching chromosomes.

Here’s the data:

Purple-flowered Red-flowered
Long pollen 1528 117
Round pollen 106 381

We see from the table there are 4 exclusive possibilities when the sweet pea self-fertilizes. If we think of each possibility having its own probability of occurrence, then we can think of this data as a sample from a multinomial distribution. Since chapter 11 covers maximum likelihood estimation, the problem therefore asks us to use the multinomial likelihood function to estimate theta.

Now the maximum likelihood estimator for each probability is \( \hat{p_{i}} = \frac{x_{i}}{n} \). But we can’t use that. That’s estimating four parameters. We need to estimate one parameter, theta. So we need to go back to the multinomial maximum likelihood function and define \( p_{i}\) in terms of theta. And of course we’ll work with the log likelihood since it’s easier to work with sums than products.

\( \log L(\theta) = y_{1} \log p_{1} + y_{2} \log p_{2} + y_{3} \log p_{3} + y_{4} \log p_{4} \)

If you’re not sure how I got that, google “log likelihood multinomial distribution” for more PDF lecture notes than you can ever hope to read.

Now let’s define the probabilities in terms of one parameter, theta:

\( \log L(\theta) = y_{1} \log f_{1}(\theta) + y_{2} \log f_{2}(\theta) + y_{3} \log f_{3}(\theta) + y_{4} \log f_{4}(\theta) \)

Now take the derivative. Once we have that we can set equal to 0 and find a solution for theta. The solution will be the point at which theta obtains its maximum value:

\( \frac{d \log L(\theta)}{d\theta} = \frac{y_{1}}{f_{1}(\theta)} f’_{1}(\theta) + \frac{y_{1}}{f_{2}(\theta)} f’_{2}(\theta) + \frac{y_{1}}{f_{3}(\theta)} f’_{3}(\theta) + \frac{y_{1}}{f_{4}(\theta)} f’_{4}(\theta)\)

Time to go from the abstract to the applied with our values. The y’s are the data from our table and the functions of theta are the results from the previous problem.

\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{1/4(2 + \theta)} \frac{1}{4} – \frac{117}{1/4(1 – \theta)}\frac{1}{4} – \frac{106}{1/4(1 – \theta)}\frac{1}{4} + \frac{381}{1/4(\theta)} \frac{1}{4} \)
\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{2 + \theta} – \frac{117}{1 – \theta} – \frac{106}{1 – \theta} + \frac{381}{\theta} \)
\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{2 + \theta} – \frac{223}{1 – \theta} + \frac{381}{\theta} \)

Set equal to 0 and solve for theta. Beware, it gets messy.

\( \frac{[1528(1 – \theta)\theta] – [223(2 + \theta)\theta] + [381(2 + \theta)(1 – \theta)]}{(2 + \theta)(1- \theta)\theta} = 0\)

Yeesh. Fortunately the denominator cancels out when we start multiplying terms and solving for theta. So we’re left with this:

\( 1528\theta – 1528\theta^{2} – 446\theta – 223\theta^{2} + 762 – 381\theta – 381\theta^{2} = 0\)

And that reduces to the following quadratic equation:

\( -2132\theta^{2} + 701\theta + 762 = 0\)

I propose we use an online calculator to solve this equation. Here’s a good one. Just plug in the coefficients and hit solve to find the roots. Our coefficients are a = -2132, b = 701, and c = 762. Since it’s a quadratic equation we get two answers:

\( x_{1} = -0.4556 \)
\( x_{2} = 0.7844 \)

The answer is in terms of a probability which is between 0 and 1, so we toss the negative answer and behold our maximum likelihood estimator for theta: 0.7844.

Remember that \( \theta = (1 – \pi)^{2}\). If we solve for pi, we get \( \pi = 1 – \theta^{1/2}\), which works out to be 0.1143. That is, we estimate the probability of characteristic genes switching chromosomes to be about 11%. Therefore we can think of theta as the probability of having two parents that experienced no gene switching.

Now point estimates are just estimates. We would like to know how good the estimate is. That’s where confidence intervals come in to play. Let’s calculate one for theta.

It turns out that we can estimate the variability of our maximum likelihood estimator as follows:

\( V(\theta) = \frac{1}{I(\theta)}\), where \( I(\theta) = -E(\frac{d^{2}\log L}{d \theta^{2}}) \)

We need to determine the second derivative of our log likelihood equation, take the expected value by plugging in our maximum likelihood estimator, multiply that by -1, and then take the reciprocal. The second derivative works out to be:

\( \frac{d^{2}\log L}{d \theta^{2}} = -\frac{1528}{(2 + \theta)^{2}} -\frac{223}{(1 – \theta)^{2}} -\frac{381}{\theta^{2}} \)

The negative expected value of the second derivative is obtained by plugging in our estimate of 0.7844 and multiplying by -1. Let’s head to the R console to calculate this:

th <- 0.7844 # our ML estimate
(I <- -1 * (-1528/((2+th)^2) - 223/((1-th)^2) - 381/(th^2))) # information
[1] 5613.731

Now take the reciprocal and we have our variance:

(v.th <- 1/I)
[1] 0.0001781347

We can take the square root of the variance to get the standard error and multiply by 1.96 to get the margin of error for our estimate. Then add/subtract the margin of error to our estimate for a confidence interval:

# confidence limits on theta
0.784+(1.96*sqrt(v.th)) # upper bound
[1] 0.8101596
0.784-(1.96*sqrt(v.th)) # lower bound
[1] 0.7578404

Finally we convert the confidence interval for theta to a confidence interval for pi:

# probability of switching over
th.ub <- 0.784+(1.96*sqrt(v.th))
th.lb <- 0.784-(1.96*sqrt(v.th))
1-sqrt(th.ub) # lower bound
[1] 0.09991136
1-sqrt(th.lb) # upper bound
[1] 0.1294597

The probability of genes switching chromosomes is most probably in the range of 10% to 13%.

A Probability Problem in Heredity

Here’s a fun problem in heredity from the Dover classic Principles of Statistics by M.G. Bulmer. It’s from chapter 2, problem 2.5.

The results of Table 7 on p. 25 (see below) can be explained on the assumption that the genes for flower colour and pollen shape are on the same chromosome but that there is a probability π that one of the genes will be exchanged for the corresponding gene on the other chromosome. If we denote the genes for purple or red flowers by P and p, and the genes for long and round pollen by L and l, then the hybrids from the cross considered will all be of the genotype PL/pl, the notation indicating that the P and L genes are on one chromosome and the p and l genes on the other. When these hybrids are allowed to self-fertilise, there is a chance π that the L and l genes will interchange in one parent, giving Pl/pL; there are therefore really three mating types, PL/pl X PL/pl, Pl/pL X PL/pl and Pl/pL x Pl/pL, which occur with probabilities \( (1 – \pi)^{2} \), \( 2\pi(1 – \pi) \) and \(\pi^{2}\) respectively. Find the probabilities of the four possible phenotypes resulting from the experiment in terms of \( \theta = (1 – \pi)^{2} \).

Here’s the table the problem refers to:

Purple-flowered Red-flowered Total
Long pollen 1528 117 1645
Round pollen 106 381 487
Total 1634 498 2132

The interesting thing about that table is that it violates Mendel’s law of independent assortment. If it obeyed that law, then the probabilities of the resulting phenotypes would be:

Purple-flowered Red-flowered
Long pollen \( \frac{9}{16} \) \( \frac{3}{16} \)
Round pollen \( \frac{3}{16} \) \( \frac{1}{16} \)

We’d expect the purple/long pollen flower to happen about 9/16 = 56% of the time. Instead we see it occurring about 1528/2132 = 72% of the time. As the problem explains this has to do with the way genes are carried on chromosomes. This means we can’t calculate probabilities as you normally would for a dihybrid cross. Therefore it asks us to calculate those probabilities conditional on whether or not the gene for pollen switched chromosomes. Our answer will take the form above, but instead of actual numbers, we’ll express our answer in terms of \( \theta = (1 – \pi)^{2} \). In other words theta is the probability that the gene for pollen did not switch chromosomes.

The hard part of this problem is setting it up. First, we have to recognize that we don’t know which chromosome has the characteristics. The characteristics of mating type PL/pl X PL/pl could both be on the PL chromosomes, in which case the result would be a PL, or a purple/long pollen flower. Or one could be PL and the other pl, which would still be PL, a purple/long pollen flower, since purple and long pollen are dominant characteristics. If both are on the pl chromosome, then the result would be pl, a red/round pollen flower. All told, for the PL/pl X PL/pl mating type we have the following possibilities:

PL/pl X PL/pl mating table (p = \( (1 – \pi)^{2}\) )
PL pl
pl PL pl

For this mating type we get a purple/long pollen flower (PL) three out of four times and a red/round pollen flower (pl) one out of four times. We need to construct similar tables for the other two mating types. But notice the Pl/pL X PL/pl mating type can actually happen in reverse order as PL/pl X Pl/pL, so it needs two tables. Therefore we really need three more tables:

Pl/pL X PL/pl mating table (p = \( \pi(1 – \pi)\) )
PL pl
Pl PL Pl
pL PL pL
PL/pl X Pl/pL mating table (p = \( \pi(1 – \pi)\) )
Pl pL
pl Pl pL
Pl/pL X Pl/pL mating table (p = \( \pi^{2}\) )
Pl pL
Pl Pl PL
pL PL pL

Have to give a quick shout out to http://truben.no/latex/table/ for helping me make those tables! Anyway…so we have 4 tables displaying 16 possible results:

  • 9 out of 16 times you get PL, a purple/long pollen flower
  • 3 out of 16 times you get Pl, a purple/round pollen flower
  • 3 out of 16 times you get pL, a red/long pollen flower
  • 1 out of 16 times you get pl, a red/round pollen flower

If the results were independent, we could just call those probabilities. But they’re not. That’s the whole point of the problem. We have to take into account the probability π of the exchange of genes from one chromosome to the other. Let’s reset the probabilities for the four mating types:

  1. PL/pl X PL/pl = \( (1-\pi)^{2} \)
  2. Pl/pL X PL/pl = \( \pi(1-\pi) \)
  3. PL/pl X pl/pL = \( \pi(1-\pi) \)
  4. Pl/pL X Pl/pL = \( \pi^{2} \)

So the probability of getting pl in the first mating type is \( \frac{1}{4}(1-\pi)^{2} \). Recall the problem asks us to find these probabilities in terms of \( \theta = (1 – \pi)^{2} \), so we can express this as \( \frac{1}{4}\theta \). And there’s one of our answers since pl does not occur in any of the other mating types.

Now we just need to find the other probabilities. To make life easier, let’s go ahead and convert all the probabilities in terms of \( \theta \):

  • \( (1 – \pi)^2 = \theta \)
  • \( (1 – \pi) = \theta^{1/2} \)
  • \( \pi = 1 – \theta^{1/2} \)
  • \( \pi^{2} = (1- \theta^{1/2})^{2} \)

The hardest one to find is PL:

\( PL = \frac{3}{4}\theta + \frac{2}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{2}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{2}{4}(1 – \theta^{1/2})^{2} \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2(1 – \theta^{1/2})\theta^{1/2} + 1 – \theta^{1/2} – \theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2(\theta^{1/2} – \theta) + 1 – 2\theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2\theta^{1/2} – 2\theta + 1 – 2\theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}(1 – \theta) \)
\( PL = \frac{3}{4}\theta + \frac{2}{4} – \frac{2}{4}\theta \)
\( PL = \frac{1}{4}\theta + \frac{2}{4} \)
\( PL = \frac{1}{4}(\theta + 2)\)

Next up is pL:

\( pL = \frac{1}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{1}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{1}{4}(1 – \theta^{1/2})^{2} \)
\( pL = \frac{1}{4}(\theta^{1/2} – \theta + \theta^{1/2} – \theta + 1 – \theta^{1/2} – \theta^{1/2} + \theta) \)
\( pL = \frac{1}{4}(-2\theta + 1 + \theta) \)
\( pL = \frac{1}{4}(1 – \theta) \)

That just leaves Pl. But if you look at the tables above, you’ll notice Pl appears in the same tables with pL the same number of times. So if we set up an equation to find its probability, we’ll get the same equation we started with when we solved Pl. That means we’ll get the same answer. So we don’t need to solve it. We already know it: \( pL = \frac{1}{4}(1 – \theta) \)

And that finishes the problem. The probabilities of the four possible phenotypes are as follows:

purple-flowered red-flowered
long pollen \( \frac{1}{4}(\theta + 2) \) \( \frac{1}{4}(1 – \theta) \)
round pollen \( \frac{1}{4}(1 – \theta) \) \( \frac{1}{4}\theta \)

Testing Composite Hypotheses about Fixed Effects (Ch 4 of ALDA)

I’m still on my ALDA kick here, this time posting about section 4.7 of Chapter 4. In my last post I talked deviance-based hypothesis tests in the context of model building. Recall you have to be aware of which method of estimation you used when you deploy the deviance statistic. Namely, if you used Restricted Maximum Likelihood (RML) to estimate model parameters, you can only use the deviance statistic to test hypotheses about variance components. This is an important point as many programs default to RML, such as the lme4 package in R and SAS PROC MIXED. But the deviance statistic is not the only tool for testing composite hypotheses.

That leads us to section 4.7 and the Wald statistic. The Wald statistic allows you to “test composite hypotheses about multiple effects regardless of the method of estimation used. This means that if you use restricted methods of estimation, which prevent you from using deviance-based tests to compare models with different fixed effects, you still have a means of testing composite hypotheses about sets of fixed effects.” (p. 122) Sounds good to me!

The authors give two examples, one of which I want to review in this post. As usual they don’t show you how to do the test using statistical software. Unfortunately either does the UCLA stats consulting page for ALDA. So I had to figure it out on my own.

Let’s reset the example study motivating the work in this chapter. 82 adolescents were surveyed on alcohol use. Some of the variables collected included:

  • alcuse, a rating-scale measure of alcohol use (the response)
  • age_14, age of participant centered about 14 (14 = 0, 15 = 1, 16 = 2)
  • coa, an indicator whether or not participant is a child of an alcoholic (1 = yes, 0 = no)
  • cpeer, a rating-scale measure of alcohol use among peers centered on its sample mean of 1.018
  • id, an arbitrary level to group persons

These variables are part of Model F, the model of interest in section 4.7, which aims to explain the variability in alcuse. (Models A through E precede Model F in the earlier model-building portion of the chapter.) Here’s Model F in its multilevel form:

level 1
Y_{ij} = \pi_{0i} + \pi_{1i}*age14_{ij} + \epsilon_{ij}

level 2
\pi_{0i} = \gamma_{00} + \gamma_{01}*coa + \gamma_{02}*cpeer + \zeta_{0i}
\pi_{1i} = \gamma_{10} + \gamma_{12}*cpeer + \zeta_{1i}

So this model posits that individuals have a liner trajectory over time (level 1), and that the parameters themselves of that linear trajectory differ between individuals based on coa and cpeer (level 2).

We can combine the two levels into one scary-looking composite representation of the model:
Y_{ij} = \gamma_{00} + \gamma_{01}*coa + \gamma_{02}*peer + \gamma_{10}*age14 + \gamma_{12}*peer*age14 + \zeta_{0i} + \zeta_{1i}*age14 + \epsilon_{ij}

Then we can estimate the parameters of that model in R with the following code:

alcohol1 <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", 
                       header=T, sep=",")
model.f1 <- lmer(alcuse ~ coa + cpeer*age_14 + (age_14 | id), alcohol1, REML = FALSE)

And now we are ready to test composite hypotheses about this model. The first example in the book asks whether the average child of non-alcoholic parents - with an average value of peer - drinks no alcohol at age 14 (intercept = 0) and remains abstinent over time (slope = 0). So we set coa and cpeer both equal to 0 in our composite model and we're left with this:

Y_{ij} = \gamma_{00} + \gamma_{10}*age14 + \zeta_{0i} + \zeta_{1i}*age14 + \epsilon_{ij}

Thus our question is essentially asking if the slope and intercept in this model are 0. Or to state it formally, our composite null hypothesis is as follows:

H_{0}: \gamma_{00} = 0 \: and \: \gamma_{10} = 0

Now to carry out this test, we need to express this hypothesis as a general linear hypothesis in matrix notation. First let's restate the hypothesis using our fixed effects:

1\gamma_{00} + 0\gamma_{01} + 0\gamma_{02} + 0\gamma_{10} + 0\gamma_{12} = 0
0\gamma_{00} + 0\gamma_{01} + 0\gamma_{02} + 1\gamma_{10} + 0\gamma_{12} = 0

We have weighted our coefficients so that the only two we're interested in are viable. Now we create a matrix of the weights. This is easier and faster to show in R than trying to use LaTeX, which I'm not even sure I can pull off with the Word Press plugin I'm using.

C <- matrix(c(1,0,0,0,0,0,0,0,1,0), nrow=2, byrow=TRUE)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    0    0    1    0

Now using the matrix we just created we can conduct a linear hypothesis test using the linearHypothesis function available in the car package, like so:

linearHypothesis(model.f1, C)

This returns a Wald statistic of 51.03 on 2 degrees of freedom, almost matching the book which reports 51.01. The p-value is practically 0, which means we reject this composite hypothesis.

Now it's nice to know there's an R function that will calculate the Wald statistic, but what is it? How can we find out? The following code reveals the source of linearHypothesis:


In it we see the following calculation:

SSH <- as.vector(t(L %*% b - rhs) %*% solve(L %*% V %*% t(L)) %*% 
        (L %*% b - rhs))

That's it. So we have some matrix multiplication happening. In this calculation L is our hypothesis matrix, b is our fixed effects, rhs means "right hand side" (which in our example is 0) and V is the variance-covariance matrix of the model parameters.

If we wanted to calculate the Wald statistic by hand, we could do the following:

# extract the model coefficients
b <- matrix(summary(model.f1)@coefs[,1],nrow=5)

# create the "right-hand side"
q <- matrix(c(0,0),nrow=2)

# extract the variance-covariance matrix
V <- vcov(model.f1)

# calculate the Wald statistic
W <- t(C%*%b - q) %*% solve(C%*% V %*%t(C)) %*% (C%*%b - q)

To calculate the p-value, use the pchisq function:


Comparing Multilevel Models using Deviance Statistics (Ch 4 of ALDA)

The tour of Applied Longitudinal Data Analysis (ALDA) by Singer and Willett continues today with section 4.6, Comparing Models using Deviance Statistics. In the section prior to this they walk through building a model by way of examining hypothesis tests for fixed effects and variance components. While the former will be familiar to those who’ve done classical linear regression, the latter is likely something new. And perhaps not advised. As I mentioned in my previous post (last paragraph), they state in section 3.6.2 that “statisticians disagree as to the nature, form, and effectiveness of these tests.” They also “suggest you examine them only with extreme caution.” Therefore I decided not to blog about that particular tactic and instead focus on “a superior method for testing hypotheses about variance components.” (Also their words.) Of course I refer to the title of this post: Deviance Statistics.

As I’ve done in my other ALDA posts, I’m going to forgo exposition and get down to business. This post is meant to serve as a reference guide for my future self and maybe yours as well.

  • The Deviance Statistic is used to test the hypothesis that additional model predictors do not improve the fit of the model. The null hypothesis is that the coefficients of the additional predictors are 0.
  • To use the Deviance Statistic, one model must be nested in the other. That is, the smaller model can be derived from the bigger model by setting certain coefficients in the bigger model equal to 0.
  • Deviance = -2 * (Log Likelihood (LL) of model)
  • Deviance Statistic = -2 * (LL of model nested in bigger model – LL of bigger model)
  • Smaller Deviance is better. If adding more predictors to a model reduces deviance, that may be a good thing. The hypothesis test using the Deviance Statistic helps us determine whether or not the reduction in deviance is significant. A large p-value tells us no, it is not significant and that our model is not improved by the additional predictors. A small p-value tells us to reject the null and keep the extra predictors.
  • The distribution of the deviance statistic is chi-square with DF equal to the number of extra parameters in the bigger model.
  • Deviance obtained under Restricted Maximum Likelihood (REML) should only be used if the two models compared have the same fixed effects and differ only in their random effects. If this is not the case, the deviance obtained using Full ML should be used instead.


The example in Chapter 4 of ALDA involves alcohol use by adolescents. 82 were surveyed over time (3 waves). Some of the data collected include:

alcuse, a continuous measure of alcohol use based on a rating scale (the response)
age_14, age of participant centered about 14 (14 = 0, 15 = 1, 16 = 2)
coa, an indicator whether or not participant is a child of an alcoholic (1 = yes, 0 = no)
id, an arbitrary level to group persons

The model building process is reproduced in R on the UCLA stats consulting site. They use the nlme package. I will use the lme4 package below to demonstrate the use of the deviance statistic.

# read in and attach the data
alcohol1 <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", header=T, sep=",")

We're going to fit a model that only has age_14 as a predictor. Then we're going to build a model that has age_14 and coa as predictors. Notice the first model is "nested" in the second. In other words we can get the first model from the second model by setting the coa coefficients to 0.

alcuse = \gamma_{00} + \gamma_{10}*age14 + \zeta + \zeta*age14 + \epsilon

alcuse = \gamma_{00} + \gamma_{10}*age14 + \gamma_{01}*coa + \gamma_{11}*age14*coa + \zeta + \zeta_{1i}*age14 + \epsilon

Is the second model better than the first? The null hypothesis is no, it is not better.

H_{0}: \gamma_{01} = \gamma_{11} = 0

The second model has two additional fixed effects and no change in the random effects. Therefore to carry out this test, both models need to be fitted using Full Maximum Likelihood. (Note the argument "REML = FALSE" in the calls to lmer() below.)

model.b1 <- lmer(alcuse ~ age_14 + (age_14 | id), alcohol1, REML = FALSE)

model.c1 <- lmer(alcuse ~ age_14*coa + (age_14 | id), alcohol1, REML = FALSE)

Now we're ready to carry out the test. We can access the deviance of each model from the summary object, like so:

[1] 636.6111
[1] 621.2026

Notice the deviance of the bigger model is smaller than the deviance of the nested model. Is the reduction in deviance significant? To carry out the test we take the deviance of the smaller nested model and subtract from it the deviance of the bigger model. The difference is then compared to a chi-square distribution for significance. In this case, we'll compare the difference to a chi-square distribution with 2 degrees of freedom since the bigger model has two extra coefficients.

dev <- summary(model.b1)@AICtab$deviance - summary(model.c1)@AICtab$deviance
[1] 15.40846
1 - pchisq(dev,2)
[1] 0.0004509163

Now that's a small p-value. That's the probability we would see a difference this large (or larger) in deviance if the two coefficients really were 0. We reject the null hypothesis and conclude our model is improved by including the two coefficients associated with the coa predictor. If we were planning to do several such tests, we could write a function to make the process go a little faster.

# function to calculate deviance statistic and return p-value
# a = nested model object, b = bigger model object, df = degrees of freedom
dev <- function(a,b,df){
return(1 - pchisq(
	         summary(a)@AICtab$deviance - 

[1] 0.0004509163

Unconditional Multilevel Models for Change (Ch 4 of ALDA)

In Chapter 4 (section 4.4) of Applied Longitudinal Data Analysis (ALDA), Singer and Willett recommend fitting two simple unconditional models before you begin multilevel model building in earnest. These two models “allow you to establish: (1) whether there is systematic variation in your outcome that is worth exploring; and (2) where that variation resides (within or between people).” (p. 92) This is a great idea. Why start building models if there is no variation to explain? In this post I want to summarize these two models for reference purposes.

Model 1: The Unconditional Means Model

  • The keyword here is “means”. This model has one fixed effect that estimates the grand mean of the response across all occasions and individuals.
  • The main reason to fit this model is to examine the random effects (i.e., the within-person and between-person variance components). This tells us the amount of variation that exists at the within-person level and the between-person level.
  • Model specification: \( Y_{ij} = \gamma_{00} + \zeta_{0i} + \epsilon_{ij} \)
    • \( \gamma_{00} \) = grand mean (fixed effect)
    • \( \zeta_{0i} \) = the amount person i’s mean deviates from the population mean (between-person)
    • \( \epsilon_{ij} \) = the amount the response on occasion j deviates from person i’s mean (within-person)
    • \( \epsilon_{ij} \sim N(0,\sigma_{\epsilon}^{2}) \)
    • \( \zeta_{0i} \sim N(0, \sigma_{0}^{2}) \)
  • Use the intraclass correlation coefficient to describe the proportion of the total outcome variation that lies “between” people: \( \rho = \sigma_{0}^{2} / (\sigma_{0}^{2} + \sigma_{\epsilon}^{2}) \)
  • In the unconditional means model the intraclass correlation coefficient is also the “error autocorrelation coefficient”, which estimates the average correlation between any pair of composite residuals: \( \zeta_{0i} + \epsilon_{ij} \)
  • Sample R code for fitting the unconditional means model (where “id” = person-level grouping indicator):
    lme(response ~ 1, data=dataset, random= ~ 1 | id)

    Or this:

    lmer(response ~ 1 + (1 | id), dataset)

To replicate the Unconditional Means Model example in ALDA, the UCLA stats page suggests the following:

alcohol1 <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", 
                       header=T, sep=",")
model.a <- lme(alcuse~ 1, alcohol1, random= ~1 |id)

This works OK, but returns slightly different results because it fits the model using REML (Restricted Maximum Likelihood) instead of ML (Maximum Likelihood). It also does not return the estimated between-person variance \( \sigma_{0}^{2} \). We can "fix" the first issue by including the argument method="ML". There doesn't appear to be anything we can do about the second. However, the lmer() function allows us to replicate the example and obtain the same results presented in the book, as follows (notice we have to specify ML implicitly with the argument REML = FALSE):

model.a1 <- lmer(alcuse ~ 1 + (1 | id), alcohol1, REML = FALSE)

The output provides the values discussed in the book in the "Random effects" section under the variance column:

> summary(model.a1)
Linear mixed model fit by maximum likelihood 
Formula: alcuse ~ 1 + (1 | id) 
   Data: alcohol1 
   AIC   BIC logLik deviance REMLdev
 676.2 686.7 -335.1    670.2     673
Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 0.56386  0.75091 
 Residual             0.56175  0.74950 
Number of obs: 246, groups: id, 82

Fixed effects:
            Estimate Std. Error t value
(Intercept)   0.9220     0.0957   9.633

The "Random effect" id has variance = 0.564. That's the between-person variance. The "Random effect" Residual has variance = 0.562. That's the within-person variance. We can access these values using "summary(model.a1)@REmat" and calculate the intraclass correlation coefficient like so:

icc_n <- as.numeric(summary(model.a1)@REmat[1,3])
icc_d <- as.numeric(summary(model.a1)@REmat[1,3]) + 
icc_n / icc_d
[1] 0.5009373

Model 2: The Unconditional Growth Model

  • This model partitions and quantifies variance across people and time.
  • The fixed effects estimate the starting point and slope of the population average change trajectory.
  • Model specification: \( Y_{ij} = \gamma_{00} + \gamma_{10}*time_{ij} + \zeta_{0i} + \zeta_{1i}*time_{ij} + \epsilon_{ij} \)
    • \( \gamma_{00} \) = average intercept (fixed effect)
    • \( \gamma_{10} \) = average slope (fixed effect)
    • \( \zeta_{0i} \) = the amount person i's intercept deviates from the population intercept
    • \( \zeta_{1i} \) = the amount person i's slope deviates from the population slope
    • \( \epsilon_{ij} \) = the amount the response on occasion j deviates from person i's true change trajectory
    • \( \epsilon_{ij} \sim N(0,\sigma_{\epsilon}^{2}) \)
    • \( \zeta_{0i} \sim N(0, \sigma_{0}^{2}) \)
    • \( \zeta_{1i} \sim N(0, \sigma_{1}^{2}) \)
    • \( \zeta_{0i}\) and \( \zeta_{1i} \) have covariance \( \sigma_{1}^{2} \)
  • The residual variance \( \sigma_{\epsilon}^{2} \) summarizes the average scatter of an individual's observed outcome values around his/her own true change trajectory. Compare this to the same value in the unconditional means model to see if within-person variation is systematically associated with linear time.
  • The level-2 variance components, \( \sigma_{0}^{2} \) and \( \sigma_{1}^{2} \) quantify the unpredicted variability in the intercept and slope of individuals. That is, they assess the scatter of a person's intercept and slope about the population average change trajectory. DO NOT compare to the same values in the unconditional means model since they have a different interpretation.
  • The level-2 covariance \( \sigma_{01} \) quantifies the population covariance between the true initial status (intercept) and true change (slope). Interpretation is easier if we re-express the covariance as a correlation coefficient: \( \hat{\rho}_{01} = \hat{\sigma}_{01} / \sqrt{\hat{\sigma}_{0}^{2}\hat{\sigma}_{1}^{2}} \)
  • Sample R code for fitting the unconditional growth model (where "id" = person-level grouping indicator):
    lme(response ~ time , data=dataset, random= ~ time | id)

    Or this:

    lmer(alcuse ~ time + (time | id), dataset)

To replicate the Unconditional Growth Model example in ALDA, the UCLA stats page suggests the following:

alcohol1 <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", 
                       header=T, sep=",")
model.b <- lme(alcuse ~ age_14 , data=alcohol1, random= ~ age_14 | id, method="ML")

However I think the following is better as it gives the same values in the book:

model.b1 <- lmer(alcuse ~ age_14 + (age_14 | id), alcohol1, REML = FALSE)

For instance it provides variance values instead of standard deviation values. It doesn't really matter in the long run, but it makes it easier to quickly check your work against the book. Here's the output:

> summary(model.b1)
Linear mixed model fit by maximum likelihood 
Formula: alcuse ~ age_14 + (age_14 | id) 
   Data: alcohol1 
   AIC   BIC logLik deviance REMLdev
 648.6 669.6 -318.3    636.6   643.2
Random effects:
 Groups   Name        Variance Std.Dev. Corr   
 id       (Intercept) 0.62436  0.79017         
          age_14      0.15120  0.38885  -0.223 
 Residual             0.33729  0.58077         
Number of obs: 246, groups: id, 82

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.65130    0.10508   6.198
age_14       0.27065    0.06246   4.334

Correlation of Fixed Effects:
age_14 -0.441

Again the main section to review is the "Random effects". The Residual variance (within-person) has decreased to 0.337 from 0.562. We can conclude that \( (0.562 - 0.337)/0.562 = 0.40 \) (i.e., 40%) of the within-person variation in the response is systematically associated with linear time. We also see the negative correlation (-0.223) between the true initial status (intercept) and true change (slope). However, the book notes this correlation is not statistically significant. As you can see this is not something the output of the lmer object reports. The book mentions in chapter 3 (p. 73) that statisticians disagree about the effectiveness of such significance tests on variance components, and I can only assume the authors of the lme4 package question their use. Finally, we notice the level-2 variance components: 0.624 and 0.151. These provide a benchmark for predictors' effects as the authors proceed to build models.

The Multilevel Model for Change (Ch 3 of ALDA) – revisited

In my previous post I talked about how to replicate the example in Chapter 3 of ALDA. It was mostly about finding the dataset and modifying the R code on the UCLA web site so it works. In this post I want to talk a little more about the statistics.

Recall the example involves two groups of infants: one assigned to a program to enhance cognitive functioning and the other acting as a control. Cognitive measurements were taken at three different time periods for both groups. Did the group assigned to the program perform differently than the control group? To answer this question the authors postulate a linear model where cognitive test results are explained by time, as follows:

Y = \beta_{0} + \beta_{1}*time

But the intercept and slope coefficients in that model are modeled as follows:

\beta_{0} = \gamma_{00} + \gamma_{01}*program
\beta_{1} = \gamma_{10} + \gamma_{11}*program

So we have two levels of modeling happening at the same time. The first level concerns within-person change while the second level concerns between-person differences in change. We can consolidate the two levels into one formula, like this:

Y = \gamma_{00} + \gamma_{10}*time + \gamma_{01}*program + \gamma_{11}*time*program

So we have an intercept and three coefficients. When we fit the model to the data, we get:

Y = 107.84 - 21.13*time + 6.85*program + 5.27*program*time

All of which are significant coefficients. When program = 0, our linear model is Y = 107.84 - 21.13*time . When program = 1, our linear model is Y = 114.69 - 15.86*time . The intercept in these models is interpreted as the cognitive score at the first measurement (when the infants were 12 months old). We can see that infants in the program had a higher performance at the first measurement than those not in the program: 114.69 – 107.84 = 6.85. The slope tells us the rate of decline of cognitive performance (decline?). We see the infants in the program had a slower rate of decline over time: -15.86 versus -21.13. Or put another way: -21.13 – -15.86 = -5.27, which is the coefficient of the interaction. That is, the difference in slopes between the two groups is -5.27.

Now it’s interesting to note that the model does not appear to make use of the fact that each subject contributed three waves of data. Our dataset has 309 records. But those 309 records are in 103 groups. Those 103 groups are the 103 infants. Each infant contributed three cognitive test scores over time. The dataset has a variable, ID, to indicate those groups. But ID is not in our model. Shouldn’t we make use of that information? Well, in fact we did. Have a look at the R code:

lme(cog ~ time*program, data=ch3, random = ~ time | id, method="ML", 
    control=list(opt = "optim"))

Notice how “id” indicates the grouping structure in the “random” argument. Time is specified as the random effect and “| id” indicates it is grouped by “id” (i.e., the 103 infants). In so many words, this allows us to capture the variability in each infant’s own change trajectory. We can think of plotting the cognitive test score for one infant over time and fitting a line to those three points. There will be some error in that line. But not as much error than if we fit a line to all infants in a group over the three times. In this latter scenario we’re not accounting for the grouping of the measurements by infant. We can actually see what happens if we don’t account for this grouping by doing an Analysis of Covariance (ANCOVA).

With ANCOVA, we’re basically doing regression with continuous and categorical variables. The usual approach to ANCOVA is to think of doing a regular ANOVA analysis but blocking on a continuous variable. For example, comparing cholesterol levels (y) between a treated group and a reference group adjusted for age (x, in years). We’re interested in the treatment effect but we want to account for the effect of age.

We can naively do ANCOVA with the Chapter 3 example from ALDA as follows:

lm(cog ~ program + time + time:program, data=ch3)

Look at the results:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   107.841      1.773  60.822  < 2e-16 ***
program         6.855      2.363   2.901  0.00399 ** 
time          -21.133      2.747  -7.694 1.99e-13 ***
program:time    5.271      3.660   1.440  0.15087 

Now compare those to the multilevel modelling results, obtained from the call to lme() above:

                 Value Std.Error  DF   t-value p-value
(Intercept)  107.84074  2.048799 204  52.63608  0.0000
time         -21.13333  1.903664 204 -11.10140  0.0000
program        6.85466  2.730259 101   2.51063  0.0136
time:program   5.27126  2.536850 204   2.07788  0.0390

Notice the similarities? That's right, both return the same model coefficients! But compare the difference in standard errors. The most dramatic is the interaction between time and program. In the ANCOVA analysis the interaction appears to be insignificant (SE = 3.7; p = 0.15). But in the multilevel model it's significant at the 5% level (SE = 2.5; p = 0.03). We see that the ANCOVA model does not take into account the change trajectories at the individual level, and is thus not sensitive enough to detect the significant difference in rates of cognitive decline between the infants in the program and those in the control group.