ALDA stands for the book, Applied Longitudinal Data Analysis, by Singer and Willett. I’ve had this lying around for a while and decided to give it another go. If you know a little about regression and want to get started with data analysis using multilevel models, this is probably a good place to start. It’s light on the math and heavy on application. Well, maybe not that heavy. They don’t actually provide anything in the way of computer code when it comes time to fit a model. They just do it and produce a table of results. Fortunately, our friends at the UCLA Statistical Consulting Group have generously provided programs to reproduce the examples in the book using R, SAS, and Stata, among others. So that’s where I turned when I wanted to replicate the example running through chapter 3.

The example involves 103 African-American infants born into low-income families. When the children were 6 months old, about half were randomly assigned to an early intervention program designed to enhance cognitive functioning. The other half received no intervention and acted as a control. Cognitive performance was measured at ages 12, 18 and 24 months by way of a test. The researches wanted to examine the effects of program participation on the test results. There’s actually a lot more to this study. Here’s the abstract. But the authors scaled it back and took a portion of the data to create a friendly introductory example to multilevel modeling. And I have to say, they succeeded. It is indeed an easy-to-follow example. The chapter can be read in one sitting and gives you a sense of what multilevel modeling is all about. In this case we want to model individual change in cognitive ability over time with a straight line,

but then also model the intercept and slope of *that *line with respect to program:

So we have two levels of modeling going on. Hence, “multilevel” modeling.

Naturally I wanted to replicate the Chapter 3 results in R. That’s why I’m reading the book, so I can learn how to do this stuff. So I hit up the UCLA site to get the necessary R code. And there at the top the page, before I can even get started, is a roadblock:

Please note that the “early_int” data file (which is used in Chapter 3) is not included among the data files. This was done at the request of the researchers who contributed this data file to ensure the privacy of the participants in the study. Although the web page shows how to obtain the results with this data file, we regret that visitors do not have access to this file to be able to replicate the results for themselves.

Seriously? C’mon. Did the authors not think about this when writing the book? Couldn’t they come up with a different example that *can* be replicated? Very frustrating. So I did what any honorable statistician would do and ~~gave up, turned off the computer and went outside to enjoy nature~~ Googled the name of the dataset used in the R code on the UCLA web site (earlyint_pp.txt). The very first result takes you to this Harvard web page, where the data in question is available in SAS and Stata formats. Now the authors of ALDA are both Harvard professors. And here is the data – that I’m not supposed to have access to – available on a Harvard web page for an old statistics class taught by one of the authors of the book. I guess the researchers changed their mind? Anyway, I now had the data and could try to replicate the results. And you’ll be happy to know the data contained no identifying information. It had four variables: (1) an ID number for the infant, (2) the test score, (3) age at test, and (4) an indicator for program participation.

I downloaded the Stata dataset because I can read that into R using the foreign package:

library(foreign) early.int <- read.dta("earlyint_pp.dta")

With the data loaded, I was ready to start replicating the results. This required me to copy and paste the R code from the UCLA web site. I don’t usually go for copying-and-pasting, but in this case I was OK with it. It was mostly exploratory analysis. And it worked like a charm. That is until I got to the part where you actually fit a multilevel model to the data. Running the following R code produced an error:

library(nlme) lme(cog ~ time*program, data=early.int, random = ~ time | id, , method="ML") Error in lme.formula(cog ~ time * program, data = early.int, random = ~time | : nlminb problem, convergence error code = 1 message = iteration limit reached without convergence (10)

Hmm. Seemed to work for the UCLA person. What to do? Long story short, I needed to specify a different optimizer. The default optimizer for lme is “nlminb”. I needed to specify “optim”, like so:

model1 <- lme(cog ~ time*program, data=early.int, random = ~ time | id, method="ML", control=list(opt = "optim")) summary(model1)

That worked! Or at least it gave me some output. Upon closer inspection the random effects were slightly different, but everything else looked about the same. Why does “optim” work and not “nlminb”? I have no idea. Obviously they go about optimization in different ways. But I’m not too worried at the moment since I’m just getting my feet wet with multilevel modeling.

In researching the error above I discovered I could also fit the multilevel model using the following code:

library(lme4) mod1 <- lmer(cog ~ time*program + (time | id), early.int) summary(mod1)

The output from this is also slightly different from the book as well as the output from the lme call above using “optim”. The coefficients of the fixed effects are all the same, but the random effects are slightly different.

So what does all this mean? How do you interpret the output? What is multilevel modeling? Well, that wasn’t quite the point of this post. I wanted to document for posterity how to replicate the results of ALDA Chapter 3 using R despite the message to the contrary on the UCLA site. Besides, if you’ve read this far then you're probably also working through ALDA and don't need me to repeat what's in the book. However, if you insist, I point you to this solid PowerPoint presentation that boils Chapter 3 of ALDA to its essence. It tells you everything you need to know about modeling this example and interpreting the results.

Pingback: The Multilevel Model for Change (Ch 3 of ALDA) – revisited | Statistics you can Probably Trust

John MuschelliThanks for the data link, nice post.

In case anyone comes looking around here,

lmer from lme4 uses REML = TRUE as default, you are fitting with ML, so you may different results for both. Moreover, you may get slight differences in the SE of the beta coefficients:

> model1

> model1.lme summary(model1)

Linear mixed-effects model fit by REML

Data: early.int

AIC BIC logLik

2374.784 2404.547 -1179.392

Random effects:

Formula: ~time | id

Structure: General positive-definite, Log-Cholesky parametrization

StdDev Corr

(Intercept) 11.295632 (Intr)

time 3.650461 -0.91

Residual 8.646211

Fixed effects: cog ~ time * program

Value Std.Error DF t-value p-value

(Intercept) 107.84074 2.054203 204 52.49760 0.0000

time -21.13333 1.902278 204 -11.10949 0.0000

program 6.85466 2.737461 101 2.50402 0.0139

time:program 5.27126 2.535004 204 2.07939 0.0388

Correlation:

(Intr) time progrm

time -0.638

program -0.750 0.479

time:program 0.479 -0.750 -0.638

Standardized Within-Group Residuals:

Min Q1 Med Q3 Max

-2.30359362 -0.55634165 0.02384933 0.56399047 2.32448601

Number of Observations: 309

Number of Groups: 103

> summary(model1.lme)

Linear mixed model fit by REML [‘lmerMod’]

Formula: cog ~ time * program + (time | id)

Data: early.int

REML criterion at convergence: 2358.7

Scaled residuals:

Min 1Q Median 3Q Max

-2.32496 -0.56664 0.02557 0.56998 2.31593

Random effects:

Groups Name Variance Std.Dev. Corr

id (Intercept) 126.76 11.259

time 10.32 3.213 -1.00

Residual 75.49 8.689

Number of obs: 309, groups: id, 103

Fixed effects:

Estimate Std. Error t value

(Intercept) 107.841 2.053 52.53

time -21.133 1.893 -11.16

program 6.855 2.736 2.51

time:program 5.271 2.523 2.09

Correlation of Fixed Effects:

(Intr) time progrm

time -0.638

program -0.750 0.479

time:progrm 0.479 -0.750 -0.638

Lynette BikosAlthough a few years later, I am working through ALDA. And like you, am completely bummed that the data for Chapter 3 is not available. Unfortunately, the website you linked is no longer available. Is there any chance you would share the datset? Thanks.

Lynette BikosMe again! You can disregard the request. I found the dataset on yet another wonderful teaching lab on GitHub: https://github.com/nmldesjardins/MLM/tree/master/Lab%205_growth%20models I am so grateful for all of you who provide these demonstrations.

Rommel BunuanDo you mind sharing the data with me Lynette? I had a hard time reading it into R.

Thanks,

Rommel

Clay FordPost authorThe data on the GitHub site appears to be in Stata format (DTA). You can read that into R using the haven package. For example, to read it in and create a data frame called “dat”:

`install.packages("haven")`

library(haven)

dat <- read_dta("earlyint_pp.dta")

Just make sure you have the DTA file in your working directory. If that confuses you, then you can try using RStudio's Import dialog. Click the Import Dataset button and select "From Stata".

Rommel BunuanHi Clay,

Thank you so much! I got it.

Rommel

Amber WardHi, this is great. Have you done anything on Ch2? I am not able to apply the code from here (https://stats.idre.ucla.edu/r/examples/alda/r-applied-longitudinal-data-analysis-ch-2/) to my own dataset, as the r code isn’t explained at all. I am trying to create plots of fitted trajectories by level of potential predictor variables (e.g., males vs female). The code provided is here:

# fitting the linear model by id, males only

tolm <- tolerance.pp[male == 1 , ]

fitmlist <- by(tolm, tolm$id, function(bydata) fitted.values(lm(tolerance ~ time, data=bydata)))

fitm <- unlist(fitmlist)

#appending the average for the whole group

lm.m <- fitted( lm(tolerance ~ time, data=tolm) )

names(lm.m) <- NULL

fit.m2 <- c(fitm, lm.m[1:5])

age.m <- c(tolm$age, seq(11,15))

id.m <- c(tolm$id, rep(111, 5))

#plotting the linear fit by id, males

#id.m=111 denotes the average value for males

interaction.plot(age.m, id.m, fit.m2, ylim=c(0, 4), xlab="AGE", ylab="TOLERANCE", lwd=1)

title(main="Males")

I'm getting lost at the numbers – 111, 11, 15 etc – there's no indication of what these mean/do in this code (and thus what I should be inputting to run this code on my own dataset). Any help would be appreciated!

Clay FordPost authorHi Amber.

Yeah, the UCLA code is rather complex, though explicit, which can be good when you’re learning this stuff. There are easier ways to do recreate Fig 2.7, which I share below. First, I’ll try to explain what each line is doing.

An easier and modern way to make the plot is as follows:

Hope this helps!

ElbeeThanks for this post! Helped me though the bit of code in chapter 3 that didn’t work.