Partial Residuals and the termplot function

Earlier this year I taught an Intro to R Graphics workshop. While preparing for it I came across the termplot function. I had no idea it existed. Of course now that I know about it, I come across it fairly regularly. (That's the Baader-Meinhof Phenomenon for you.)

?termplot will tell you what it does: “Plots regression terms against their predictors, optionally with standard errors and partial residuals added.” If you read on, you'll see “Nothing sensible happens for interaction terms, and they may cause errors.” Duly noted.

Let's try this out and see what it does.

First we'll fit a model using the mtcars dataset that comes with R. Below mpg = Miles/(US) gallon, drat = Rear axle ratio, qsec = ¼ mile time, disp = Displacement (cu.in.), and wt = Weight (lb/1000).

lm.cars <- lm(mpg ~ wt + disp + drat + qsec, data=mtcars)
summary(lm.cars)
## 
## Call:
## lm(formula = mpg ~ wt + disp + drat + qsec, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.143 -1.933 -0.149  0.919  5.541 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.03223    9.58829    0.94  0.35454    
## wt          -4.86768    1.20872   -4.03  0.00041 ***
## disp         0.00522    0.01105    0.47  0.64021    
## drat         1.87086    1.32462    1.41  0.16926    
## qsec         1.05248    0.34778    3.03  0.00539 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.6 on 27 degrees of freedom
## Multiple R-squared:  0.838,  Adjusted R-squared:  0.814 
## F-statistic:   35 on 4 and 27 DF,  p-value: 2.55e-10
par(mfrow=c(2,2))
termplot(lm.cars)

plot of chunk unnamed-chunk-1

Compare the coefficients in the summary output to the slopes of the lines in the graphs. See how the coefficients match the slopes of the lines? That's because they are the slopes of the lines. For example, wt has a coefficient of about -5. Likewise the termplot for that coefficient has a negative slope that appears to be about -5. Using these plots we can see at a glance the respective contributions of the predictors. wt and qsec seem to contribute to the model due to the steepness of their slopes, but disp and drat not so much. This matches the summary output that shows wt and qsec as being significant.

Now termplot also has an argument called partial.resid that will add partial residuals if set to TRUE. Let's try that out, with color set to “purple” since the default gray can be hard to see (at least to my eyes):

par(mfrow=c(2,2))
termplot(lm.cars, partial.resid=TRUE, col.res = "purple")

plot of chunk unnamed-chunk-2

How are these points generated? The x-axis is clear enough. That's just the predictor values. But the y-axis says “Partial for [variable name]”, i.e. partial residuals. What does that mean? Put as simply as possible, partial residuals are the sum of the residuals and predictor terms. Hopefully you know what residuals are. Those are just the difference between the observed and fitted response values. “Predictor terms” is a little trickier.

To understand them, it helps to know that

\[y = \bar{y} + b_{1}(x_{1} – \bar{x}_{1}) + b_{2}(x_{2} – \bar{x}_{2}) + b_{3}(x_{3} – \bar{x}_{3}) + b_{4}(x_{4} – \bar{x}_{4}) + e\]

is another way of writing

\[y = b_{0} + b_{1}x_{1} + b_{2}x_{2}+ b_{3}x_{3}+ b_{4}x_{4} + e\]

So the first equation above has \(\bar{y}\) as the intercept and centered predictors. If we center the predictors and fit a model, we'll see that's precisely what we get:

# center the variables
mtcars <- transform(mtcars, wtC = wt-mean(wt), dispC = disp-mean(disp),
                    dratC=drat-mean(drat), qsecC=qsec-mean(qsec))
# now fit model using centered predictors
lm.terms.cars <- lm(mpg ~ wtC + dispC + dratC + qsecC, data=mtcars)
coef(lm.terms.cars)[1]
## (Intercept) 
##       20.09
mean(mtcars$mpg)
## [1] 20.09

Now if we take the centered predictor values and multiply them by their respective coefficients, we get what are referred to in R as terms. For example, the terms for the first record in the data set are as follows:

# take all coefficients but the intercept...
coef(lm.terms.cars)[-1]
##       wtC     dispC     dratC     qsecC 
## -4.867682  0.005222  1.870855  1.052478
# ...and first record of data...
lm.terms.cars$model[1,-1]
##               wtC  dispC  dratC  qsecC
## Mazda RX4 -0.5972 -70.72 0.3034 -1.389
# ...and multiply:
coef(lm.terms.cars)[-1]*t(lm.terms.cars$model[1,-1])
##       Mazda RX4
## wtC      2.9072
## dispC   -0.3693
## dratC    0.5677
## qsecC   -1.4616

So for the first record, the term due to wt is 2.91, the term due to disp is -0.37, and so on. These are the predictor terms I mentioned earlier. These are what we add to the residuals to make partial residuals.

You'll be happy to know we don't have to center predictors and fit a new model to extract terms. R can do that for us on the original model object when you specify type="terms" with the predict function, like so:

# notice we're calling this on the original model object, lm.cars
pterms <- predict(lm.cars, type="terms")
pterms[1,] # compare to above; the same
##      wt    disp    drat    qsec 
##  2.9072 -0.3693  0.5677 -1.4616

To calculate the partial residuals ourselves and make our own partial residual plots, we can do this:

# add residuals to each column of pterms (i.e., the terms for each predictor)
partial.residuals <- apply(pterms,2,function(x)x+resid(lm.cars))
# create our own termplot of partial residuals
par(mfrow=c(2,2))
# the model in lm.cars includes the response in first column, so we index with i + 1
for(i in 1:4){
  plot(x=lm.cars$model[,(i+1)],y=partial.residuals[,i], col="purple", ylim=c(-10,10))
  abline(lm(partial.residuals[,i] ~ lm.cars$model[,(i+1)]), col="red")
}

plot of chunk unnamed-chunk-6

So that's what termplot does for us: it takes the terms for each predictor, adds the residuals to the terms to create partial residuals, and then plots partial residuals versus their respective predictor, (if you specify partial.resid=TRUE). And of course it plots a fitted line, the result of regressing the predictor's partial residuals on itself. Recall ealier I mentioned the slope of the line in the termplot graphs is the coefficient in the summary output. We can indeed verify that:

# coefficient for wt
coef(lm.cars)[2]
##     wt 
## -4.868
coef(lm(partial.residuals[,1] ~ mtcars$wt))[2]
## mtcars$wt 
##    -4.868

Ultimately what this whole partial residual/termplot thing is doing is splitting the response value into different parts: an overall mean, a term that is due to wt, a term that is due to disp, a term that is due to drat, and a term that is due to qsec. And you have the residual. So when we create the partial residual for wt, what we're doing is adding the wt term and the residual. This sum accounts for the part of the response not explained by the other terms.

Again let's look at the overall mean, the terms for the first observation and its residual:

# overall mean of response
mean(mtcars$mpg)
## [1] 20.09
# terms of first observation
pterms[1,]
##      wt    disp    drat    qsec 
##  2.9072 -0.3693  0.5677 -1.4616
# the residual of the first observation
resid(lm.cars)[1]
## Mazda RX4 
##   -0.7346

If we add those up, we get the observed value

# add overall mean, terms and residual for first observation
mean(mtcars$mpg) + sum(pterms[1,]) + resid(lm.cars)[1]
## Mazda RX4 
##        21
# same as observed in data
mtcars$mpg[1]
## [1] 21

So the wt term value of 2.9072 represents the part of the response that is not explained by the other terms.

Finally, we can add another argument to termplot, smooth=panel.smooth that will draw a smooth lowess line through the points. This can help us assess departures from linearity.

par(mfrow=c(2,2))
termplot(lm.cars, partial.resid=TRUE, col.res = "purple", smooth=panel.smooth)

plot of chunk unnamed-chunk-10

Notice how the dashed line bends around the straight line for wt. Perhaps wt requires a quadratic term? Let's try that using the poly function, which creates orthogonal polynomials.

lm.cars2 <- lm(mpg ~ poly(wt,2) + disp + drat + qsec, data=mtcars)
par(mfrow=c(2,2))
termplot(lm.cars2, partial.resid=TRUE, col.res = "purple", smooth=panel.smooth)

plot of chunk unnamed-chunk-11

We now see a better fit with the regressed line matching closely to the smooth line.

Permutation Tests

Let's talk about permutation tests and why we might want to do them.

First think about the two-sample t-test. The null hypothesis of this test is that both samples come from the same distribution. If we assume both samples come from the same approximately normal distribution, we can use math formulas based on probability theory to calculate a test statistic. We can then calculate the probability of getting such a test statistic (or one greater) under the assumption of both samples coming from the same distribution. For example, let's draw two samples from the same normal distribution and run a t-test.

set.seed(135)
s1 <- round(rnorm(10, 100, 16))
s2 <- round(rnorm(10, 100, 16))
t.test(s1, s2, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  s1 and s2
## t = 0.8193, df = 18, p-value = 0.4233
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.508 17.108
## sample estimates:
## mean of x mean of y 
##     102.4      97.6

As expected the p-value is high, telling us the difference in their means is not significant. Another interpretation is that the resulting t-test statistic (0.8193) would be likely if both samples came from the same distribution.

In real life we don't know with certainty the distribution of the population from which our samples are drawn. We also don't know the variance of the distribution or whether the distribution is even symmetric. We also may be stuck with a small sample size. Fortunately the t-test is pretty robust and usually reliable even when its assumptions are wrong. However, if you have your doubts, you can try a permutation test.

In the case our two-sample example above, the permutation test takes all possible combinations of group membership and creates a permutation distribution. In other words, if we assume both samples came from the same population, a data point in group 1 is just as likely to appear in group 2. If we determine all possible permutations, we can calculate our statistic of interest for each permutation and create a distribution. We can then assess where our original statistic falls in this distribution. If it's in the tails then we have evidence that our two samples come from two different populations. Now if your sample sizes are bigger than say, 15, creating the permutations can get computationally expensive very fast. So this is probably not something you want to do unless you're dealing with small samples.

Here's how we can do a permutation test “by hand” in R with our fake data:

d0 <- mean(s1) - mean(s2)
alls <- c(s1, s2)  # combine into one vector
N <- length(alls)
n <- length(s1)
p <- combn(N, n)  # generate all combinations of n chosen from N
dmeans <- numeric(dim(p)[2])  # vector to store means
for (i in 1:dim(p)[2]) {
    dmeans[i] <- mean(alls[p[, i]]) - mean(alls[-p[, i]])
}
# plot histogram of all possible difference in means with lines indicating
# our original difference
hist(dmeans)
abline(v = d0, lty = 2)
abline(v = -d0, lty = 2)

plot of chunk unnamed-chunk-2

The combn function generates a matrix of all possible index combinations of n chosen from N. I say “index” because combn doesn't return the values themselves but rather their index position in the original vector. Thus we can use it to select all possible combinations from our combined vector, which is what we do in the for loop. In this case there are 184756 possible combinations. Once all possible differences in means are calculated we can plot a histogram of the differences and draw some lines to indicate where our original sample falls. Here I drew two lines to indicate the absolute value of the difference, the equivalent of a two-sided test. The probability (or p-value) of getting a difference more extreme than our original sample is pretty high. We can see this because there's a lot of histogram on either side of our lines. We can also calculate this value:

# two-sided p-value
signif((sum(dmeans <= -abs(d0)) + sum(dmeans >= abs(d0)))/dim(p)[2])
## [1] 0.4291

Very close to the p-value returned by the t-test.

Of course there are packages for this sort of thing. The one I know of is perm. Here's how we use its permTS function to replicate our results above:

library(perm)
permTS(s1, s2, alternative = "two.sided", method = "exact.ce", 
control = permControl(tsmethod = "abs"))
## 
##  Exact Permutation Test (complete enumeration)
## 
## data:  s1 and s2
## p-value = 0.4291
## alternative hypothesis: true mean s1 - mean s2 is  0
## sample estimates:
## mean s1 - mean s2 
##               4.8

The method= argument tells the function to do “complete enumeration”. The control= argument says how to calculate the p-value (i.e., the same way we did it “by hand”).

So that's a literal two-sample permutation test. As I mentioned, if your samples are large then this approach is not feasible as the number of permutations grows out of control. For example, two groups of size 20 results in 137,846,528,820 combinations. So we usually resort to resampling methods. This is where we repeatedly resample from our combined vector of values to get a large number of combinations. We don't generate all combinations, but enough to give us a good idea. Here's one way to do it:

dmeans <- numeric(2000)  # vector to store means
for (i in 1:2000) {
    g <- sample(N, n)
    dmeans[i] <- mean(alls[g]) - mean(alls[-g])
}
hist(dmeans)
abline(v = d0, lty = 2)
abline(v = -d0, lty = 2)

plot of chunk unnamed-chunk-5

signif((sum(dmeans <= -abs(d0)) + sum(dmeans >= abs(d0)))/2000)
## [1] 0.4335

So out of 1.8476 × 105 possible permutations we only calculated 2000 (assuming no repeats) and we still got a p-value pretty close to that of our literal permutation test. Not bad.

We can do the same using the permTS function as follows:

permTS(s1, s2, alternative = "two.sided", 
method = "exact.mc", control = permControl(nmc = 2000, 
    setSEED = FALSE, tsmethod = "abs"))
## 
##  Exact Permutation Test Estimated by Monte Carlo
## 
## data:  s1 and s2
## p-value = 0.4213
## alternative hypothesis: true mean s1 - mean s2 is  0
## sample estimates:
## mean s1 - mean s2 
##               4.8 
## 
## p-value estimated from 2000 Monte Carlo replications
## 99 percent confidence interval on p-value:
##  0.3925 0.4498

The method “exact.mc” says use Monte Carlo simulation. The nmc= argument specifies number of replications (the default is 999). The setSEED= argument says not to set a seed. I want different random replications each time I run this line of code.

To wrap up, when does it make sense to use permutation tests? When you have something to permute! I quote from the classic text, An Introduction to the Bootstrap:

Permutation methods tend to apply to only a narrow range of problems. However when they apply, as in testing F = G in a two-sample problem, they give gratifyingly exact answers without parametric assumptions.

Simulating responses from a linear model

Say you fit a model using R's lm() function. What you get in return are coefficients that represent an estimate of the linear function that gave rise to the data. The assumption is the response of the model is normally distributed with a mean equal to the linear function and a standard deviation equal to the standard deviation of the residuals. Using notation we express this as follows for a simple linear model with intercept and slope coefficients:

\[ Y_{i} \sim N(\beta_{0} + \beta_{1}x_{i},\sigma) \]

The implication of this in the world of statistical computing is that we can simulate responses given values of x. R makes this wonderfully easy with its simulate() function. Simply pass it the model object and the number of simulations you want and it returns a matrix of simulated responses. A quick example:

set.seed(1)
# generate data from the linear function y = 3 + 4.2x with random noise
x <- seq(12, 23, length = 100)
y <- 3 + 4.2 * x + rnorm(100, 0, 5)
mod <- lm(y ~ x)
summary(mod)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.700  -3.029   0.078   2.926  11.487 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.900      2.504    1.56     0.12    
## x              4.180      0.141   29.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.51 on 98 degrees of freedom
## Multiple R-squared:   0.9,   Adjusted R-squared:  0.899 
## F-statistic:  882 on 1 and 98 DF,  p-value: <2e-16

Regressing y on x estimates the coefficients to be about 3.9 and 4.18, pretty close to the true values of 3 and 4.2. It also estimates the standard deviation of the error term to be 4.51, not too far from the 5 we specified in rnorm() when we generated the noise. Now let's use our model to simulate responses (i.e., y values).

head(simulate(mod, 5))
##   sim_1 sim_2 sim_3 sim_4 sim_5
## 1 51.26 55.90 58.09 58.91 54.40
## 2 54.71 62.14 49.79 63.08 53.18
## 3 50.87 62.15 63.88 52.26 49.64
## 4 56.16 53.96 53.72 53.69 55.50
## 5 52.96 45.60 63.38 54.04 60.39
## 6 64.35 67.65 63.20 54.68 63.57

The simulate() function returns a data frame. In this case the data frame has 100 rows and 5 columns. (I use the head() function to only show the first six rows.) Each column represents a simulated set of responses from our linear model given our x values. In the first row we have 5 values drawn from a Normal distribution with mean \( 3.89 + 4.17x_{1} \) and a standard deviation of 4.51. In the second row we have a value drawn from a Normal distribution with mean \( 3.89 + 4.17x_{2} \) and a standard deviation of 4.51. And so on. We could do this “manually” as follows:

simResp <- matrix(0, 100, 5)
for (i in 1:5) {
    simResp[, i] <- rnorm(5, c(coef(mod)[1] + coef(mod)[2] * x), rep(summary(mod)$sigma, 
        5))
}
simResp[1:6, ]
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] 52.52 48.93 54.80 52.14 52.90
## [2,] 61.30 47.77 56.02 53.17 46.46
## [3,] 57.37 53.98 53.25 46.90 63.04
## [4,] 57.90 64.48 49.14 54.33 63.41
## [5,] 55.30 56.91 67.99 54.80 59.03
## [6,] 52.52 48.93 54.80 52.14 52.90

But the simulate() function is much faster and easier to use.

Now 5 simulations is rather puny. A better number would be 1000. Then we can use our simulated y values to generate 1000 linear models, and thus have 1000 coefficient estimates. We can then use those estimates to create kernel density plots that allow us to visualize the amount of uncertainty in our coefficient estimates. Let's do that. Notice how I convert the output of the simulate() function to a matrix. That allows me to feed the 1000 responses to the lm() function in one fell swoop instead of coding a for loop to iterate through each column.

simResp <- data.matrix(simulate(mod, 1000))
modSim <- lm(simResp ~ x)

modSim is the usual linear model object, but with the results of 1000 models instead of 1. Here are the coefficients for the first 5 models:

coef(modSim)[1:2, 1:5]
##               sim_1 sim_2 sim_3 sim_4 sim_5
## (Intercept) -0.5932 3.949 3.861 5.683 5.038
## x            4.3836 4.166 4.199 4.062 4.113

Let's create a scatterplot of these coefficients.

plot(modSim$coefficients[1, ], modSim$coefficients[2, ], xlab = "intercept", 
    ylab = "slope", main = "Scatterplot of coefficients from simulated models")

plot of chunk unnamed-chunk-6

We see that the variability of the slope estimates is pretty small (3.8 – 4.6), while the intercept estimates vary widely (from below 0 to 10). This jives with our original model (see output above) where the slope coefficient was signficant with a small standard error while the intercept was not significant and had a relatively large standard error.

Next we create kernel density plots of our coefficient estimates. In addition I add a vertical line to represent the mean of the 1000 coefficient estimates and add the mean itself to the plot using the text() function. I use the difference in the range of the axes to allow me to automatically determine suitable coordinates.

d1 <- density(modSim$coefficients[2, ])
d2 <- density(modSim$coefficients[1, ])
# slope
plot(d1, main = "distribution of slope coefficients")
abline(v = mean(modSim$coefficients[2, ]))
text(x = mean(modSim$coefficients[2, ]) + diff(range(d1$x)) * 0.05, y = diff(range(d1$y)) * 
    0.05, labels = round(mean(modSim$coefficients[2, ]), 2))

plot of chunk unnamed-chunk-7

# intercept
plot(d2, main = "distribution of intercept coefficients")
abline(v = mean(modSim$coefficients[1, ]))
text(x = mean(modSim$coefficients[1, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d2$y)) * 
    0.05, labels = round(mean(modSim$coefficients[1, ]), 2))

plot of chunk unnamed-chunk-7

If you don't compare the scale of the x-axis in the two plots the coefficients appear to have similar variability. Perhaps a better way to do it is to put both plots in one window and set the limits to the x-axis for both plots to be the range of the estimated intercept coefficients.

par(mfrow = c(2, 1))
# slope
plot(d1, main = "distribution of slope coefficients", xlim = range(d2$x))
abline(v = mean(modSim$coefficients[2, ]))
text(x = mean(modSim$coefficients[2, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d1$y)) * 
    0.05, labels = round(mean(modSim$coefficients[2, ]), 2))
# intercept
plot(d2, main = "distribution of intercept coefficients", xlim = range(d2$x))
abline(v = mean(modSim$coefficients[1, ]))
text(x = mean(modSim$coefficients[1, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d2$y)) * 
    0.05, labels = round(mean(modSim$coefficients[1, ]), 2))

plot of chunk unnamed-chunk-8

par(mfrow = c(1, 1))

Now we see just how uncertain our intercept estimate is compared to the slope estimate.

Finally, why don't we go ahead and plot all 1000 fitted lines over the original data along with our original fitted line and the traditional 95% confidence band we get using our original model.

plot(x, y, col = "red", pch = 19)
abline(mod)  # original fitted line
apply(coef(modSim), 2, abline, col = "gray", lty = 3)
# predicted responses from original model
modOrig <- predict(mod, interval = "confidence")
# add 95% confidence band
lines(x, modOrig[, "lwr"], col = "blue")
lines(x, modOrig[, "upr"], col = "blue")

plot of chunk unnamed-chunk-9

The plotting of the 1000 lines essentially provides us a confidence band for the original fitted line. We see the blue lines are slightly closer to the fitted line than the band produced by our simulated models. That's because the blue lines represent the 95% confidence band. In other words, we would expect 95% of our 1000 estimated lines to fit within those blue lines. The gray lines we see outside of the blue lines are those 5% that don't fall within the 95% confidence band.

Getting started with GitHub Gist in WordPress

I’ve decided it’s time to use GitHub Gist for posting code snippets in my blog. My original motivation was better syntax highlighting. I was using WP Code Highlight, which isn’t bad. It’s easy to use and does what it promises. But it doesn’t add line numbers and the syntax highlighting doesn’t seem optimal for R. GitHub Gist on the other hand handles R code beautifully.

Now highlighted code is nice, but what’s really cool about Gist is that the code is actually stored in a repository. This means I can update the code without having to update my WordPress post. And it’s easier for other people to use should they ever want to use my code. And there are other benefits as well. I guess the only bad thing is that GitHub has my code, which would be bad if they went belly-up. I’m pretty sure that won’t happen though, because I don’t want to think about it.

So here’s how you use GitHub Gist with WordPress to host R code. First create an account on GitHub and go to GitHub Gist. Where it says “Name of file”, type the name of your file and add “.r” to the end. Next select R from the language pulldown, though Gist will probably select it for you if you typed a file name ending in “.r”. Now copy and paste your code into the editor and click “create public gist”. You’re done with GitbHub

To make this block of code appear in your WordPress post, copy the code in the “Embed this gist” field and paste into your post. (Make sure you’re in the Text editor instead of the Visual editor). It would be really nice if that was the end of it, but it’s not. WordPress strips out the embed code and nothing happens. The workaround is to add a little plug-in called Raw HTML. As the description says, “This plugin adds a set of shortcodes that you can use to “protect” specific parts of your post and prevent WP from messing with them.” Awesome, just what we need. So the final step is to surround the Gist embed code with [raw]…[/raw] shortcode. (A tip of my hat to this site for telling me about Raw HTML.

Let’s try this out. Here’s some code that samples 30 values from a N(15,2) distribution, calculates a 95% confidence interval, and plots the interval as a line. It does this 40 times and colors the CI’s that don’t capture 15 red.

To make that appear in my post, I added the following:

[raw]
<script src=”https://gist.github.com/clayford/7842136.js”></script>
[/raw]

Exploring Unordered Contrasts in R

Contrasts in R determine how linear model coefficients of categorical variables are interpreted. The default contrast for unordered categorical variables is the Treatment contrast. This means the “first” level (aka, the baseline) is rolled into the intercept and all subsequent levels have a coefficient that represents their difference from the baseline. That’s not too hard to grasp. But what about other contrasts, namely the Helmert and Sum contrasts? What do they do? Instead of explaining them, I figured I would demonstrate each.

First, let’s make some data:


set.seed(2112)
# create 3 levels, 10 each
flevels <- factor(rep(c("A","B","C"),c(10,10,10))) 
# create some "nice" data, sorted so means at each level have good separation
vals <- sort(round(runif(30,3,15))) 
# calculate mean of each level for reference
means <- tapply(vals,flevels,mean) 
means
   A    B    C 
 6.0 10.1 12.9

Our data consist of three levels of arbitrary values. "flevels" is our categorical variable. Notice I explicitly defined it to be a factor using the factor() function. I need to do this so R knows this variable is a factor and codes it according to whatever contrast setting we decide to use.

Let's verify the default unordered contrast setting is the Treatment contrast:


options("contrasts")
$contrasts
        unordered           ordered 
"contr.treatment"      "contr.poly"

Indeed it is. This means our factor levels are coded as follows:


contrasts(flevels)
  B C
A 0 0
B 1 0
C 0 1

This is a 3 x 2 matrix. The 2 columns of the matrix tells us that our model will have 2 coefficients, one for the B level and one for the C level. Therefore the A level is the baseline. The coefficients we get in our linear model for B and C will indicate the respective differences of their mean from the level A mean. The values in the rows tell us what values to plug into the model to get the means for the row labels. For example, to get the mean for A we plug in 0's for both coefficients which leaves us with the intercept. Therefore the intercept is the mean of A. Let's see all this in action before we explore the Helmert and Sum contrasts.


m.trt <- lm(vals ~ flevels)
summary(m.trt)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.0000     0.3935  15.249 8.63e-15 ***
flevelsB      4.1000     0.5564   7.368 6.32e-08 ***
flevelsC      6.9000     0.5564  12.400 1.17e-12 ***

Now we can verify how the Treatment contrast works by extracting the coefficient values from the model and comparing to the means we calculated earlier:


# Intercept = mean of A
coef(m.trt)[1]
(Intercept) 
          6 
means[1]
A 
6 
# flevelsB = mean of B - mean of A
coef(m.trt)[2]
flevelsB 
     4.1 
means[2] - means[1]
  B 
4.1 
# flevelsC = mean of C - mean of A
coef(m.trt)[3]
flevelsC 
     6.9 
means[3] - means[1]
  C 
6.9 

Let's also verify that plugging in the row values of the contrast matrix returns the means of each level:


# plug in row values into model to get the means of A, B and C, respectively
means
   A    B    C 
 6.0 10.1 12.9 
# mean of A --> row 1: 0 0
coef(m.trt)[1] + 0*coef(m.trt)[2] + 0*coef(m.trt)[3]
(Intercept) 
          6 
# mean of B --> row 2: 0 0
coef(m.trt)[1] + 1*coef(m.trt)[2] + 0*coef(m.trt)[3]
(Intercept) 
       10.1 
# mean of C --> row 3: 0 0
coef(m.trt)[1] + 0*coef(m.trt)[2] + 1*coef(m.trt)[3]
(Intercept) 
       12.9 

So that's how Treatment contrasts work. Now let's look at Helmert contrasts. "The coefficients for the Helmert regressors compare each level with the average of the "preceding" ones", says Fox in his book An R and S-Plus Companion to Applied Regression. I guess that makes sense. Kind of. Eh, not really. At least not to me. I say we do as we did before: fit a model and compare the coefficients to the means and see what they mean. Before we do that we need to set the contrast to Helmert:


# set contrast to "contr.helmert"
contrasts(flevels) <- "contr.helmert"
contrasts(flevels) # take a look
  [,1] [,2]
A   -1   -1
B    1   -1
C    0    2

Interesting. Notice the column labels are no longer associated with the levels of the factor. They just say 1 and 2. However this still tells us that our model will have two coefficients. Again the row values tell us what to plug in to get the means of A, B and C, respectively. To get the mean of A, we plug in -1 and -1 to the model. This means our intercept has a different interpretation. Let's fit the linear model and investigate.


m.hel <- lm(vals ~ flevels)
summary(m.hel)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6667     0.2272  42.553  < 2e-16 ***
flevels1      2.0500     0.2782   7.368 6.32e-08 ***
flevels2      1.6167     0.1606  10.064 1.24e-10 ***

It turns out the intercept is the mean of the means, the first coefficient is the mean of the first two levels minus the first level, and the second coefficient is the mean of all three levels minus the mean of the first two levels. Did you get that? Here, this may help:


# intercept = mean of all means
coef(m.hel)[1]
(Intercept) 
   9.666667 
mean(means)
[1] 9.666667
# flevels1 = mean of first two levels minus first level
coef(m.hel)[2]
flevels1 
    2.05 
mean(means[1:2]) - means[1]
   A 
2.05 
# flevels2 = mean of all three levels minus mean of first two levels
coef(m.hel)[3]
flevels2 
1.616667 
mean(means) - mean(means[1:2])
[1] 1.616667

Let's do that thing again where we plug in the row values of the contrast matrix to verify it returns the means of the levels:


means
   A    B    C 
 6.0 10.1 12.9 
# mean of A --> row 1: -1 -1
coef(m.hel)[1] + -1*coef(m.hel)[2] + -1*coef(m.hel)[3]
(Intercept) 
          6 
# mean of B --> row 2: 1 -1
coef(m.hel)[1] + 1*coef(m.hel)[2] + -1*coef(m.hel)[3]
(Intercept) 
       10.1 
# mean of C --> row 3: 0 2
coef(m.hel)[1] + 0*coef(m.hel)[2] + 2*coef(m.hel)[3]
(Intercept) 
       12.9 

That leaves us with the Sum contrast. Regarding models fitted with the Sum contrasts, Fox tells us that "each coefficient compares the corresponding level of the factor to the average of the other levels." I think like Helmert contrasts, this one is better demonstrated. As before we need to change the contrast setting.


# set contrast to "contr.sum"
contrasts(flevels) <- "contr.sum"
contrasts(flevels) # take a look
  [,1] [,2]
A    1    0
B    0    1
C   -1   -1

Just like the Helmert contrast we see two columns with no labels. Our model will have two coefficients that don't correspond directly to the levels of our factors. By now we know the values in the rows are what we plug into our model to get the means of our levels. To get the mean of level A, we plug in 1 and 0. Time to fit the model and investigate:


m.sum <- lm(vals ~ flevels)
summary(m.sum)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6667     0.2272  42.553  < 2e-16 ***
flevels1     -3.6667     0.3213 -11.413 7.75e-12 ***
flevels2      0.4333     0.3213   1.349    0.189  

Like the Helmert contrasts, our intercept is mean of all means. But our two coefficients have different interpretations. The first is the difference between the mean of level 1 (A) and the mean of all means. The second coefficient is the difference between the mean of level 2 (B) and the mean of all means. Notice in the model output above that the second coefficient is not significant. In other words, the mean of level B is not significantly different from the mean of all means.


# intercept = mean of all means
coef(m.sum)[1]
(Intercept) 
   9.666667 
mean(means)
[1] 9.666667
# flevels1 = mean of level 1 (here, A) - mean of all means 
coef(m.sum)[2]
 flevels1 
-3.666667 
means[1] - mean(means)
        A 
-3.666667 
# flevels2 = mean of level 2 (here, B) - mean of all means 
coef(m.sum)[3]
 flevels2 
0.4333333 
means[2] - mean(means)
        B 
0.4333333

Finally to be complete we plug in the row values of the Sum contrast matrix to verify it returns the means of the factor levels:


means
   A    B    C 
 6.0 10.1 12.9 
# mean of A -->row 1: 1 0
coef(m.sum)[1] + 1*coef(m.sum)[2] + 0*coef(m.sum)[3]
(Intercept) 
          6 
# mean of B -->row 2: 0 1
coef(m.sum)[1] + 0*coef(m.sum)[2] + 1*coef(m.sum)[3]
(Intercept) 
       10.1 
# mean of C -->row 3: -1 -1
coef(m.sum)[1] + -1*coef(m.sum)[2] + -1*coef(m.sum)[3]
(Intercept) 
       12.9 

And finally we wrap up this exercise by returning the contrast level of our categorical variable back to the system default:


contrasts(flevels) <- NULL

Hopefully this helps you get a better handle on what Helmert and Sum contrasts do.

Modeling Discontinuous Change (Ch 6 of ALDA)

Chapter 6 of ALDA introduces strategies for fitting models in which individual change is discontinuous. This means the linear trajectory has a shift in the elevation and/or slope. To fit a model with discontinuous change we need to “include one (or more) time-varying predictor(s) that specify whether and, if so, when each person experiences the hypothesized shift.” (p. 191)

Some of the forms discontinuous change can take include:

  • an immediate shift in elevation, but no shift in slope
  • an immediate shift in slope, but no shift in elevation
  • immediate shifts in both elevation and slope
  • shifts in elevation (or slope) that differ in magnitude by time
  • multiple shifts in elevation (or slope) during multiple epochs of time

The example in the book replicates work done by Murnane, et al. (1999). In their paper they analyzed wage data for high school dropouts and investigated whether (log) wage trajectories remained smooth functions of work experience. Their idea was that obtaining a GED might command a higher wage and thus cause a discontinuity in the linear model fit to the data. The authors partially replicate this work by fitting a taxonomy of multilevel models.

Here’s the data (courtesy of UCLA IDRE). The variables of interest are:

  • id – person ID
  • lnw – natural log of wages (the response)
  • ged – indicator (1 = attained GED; 0 otherwise)
  • exper – years in labor force to nearest day
  • postexp – years in labor force from day of GED attainment
  • hgc.9 – highest grade completed, centered on grade 9
  • black – indicator (1 = black; 0 otherwise)
  • ue.7 – unemployment rate, centered on 7%

The initial model is rather elaborate due to earlier work in the book. Using the book’s notation we state the level-1 and level-2 models as follows:

Level-1
Y_{ij} = \pi_{0i} + \pi_{1i}EXPER_{ij} + \pi_{2i}(UE_{ij} - 7) + \epsilon_{ij}
Level-2
\pi_{0i} = \gamma_{00} + \gamma_{01}(HGC_{i}-9) + \xi_{0i}
\pi_{1i} = \gamma_{10} + \gamma_{12}BLACK_{i}  + \xi_{1i}
\pi_{2i} = \gamma_{20}
Where,
\epsilon_{ij} \sim N(0,\sigma_{\epsilon}^{2}
and
\begin{bmatrix} \xi_{0i}\\ \xi_{1i} \end{bmatrix} \sim N \begin{pmatrix} \begin{bmatrix} 0\\ 0 \end{bmatrix}, \begin{bmatrix} \sigma_{0}^{2} \sigma_{01}\\ \sigma_{10} \sigma_{1}^{2} \end{bmatrix} \end{pmatrix}

What a mess. Let’s break this down. The level-1 model is the individual growth model. It posits an individual’s wages can be explained by his years in the labor force (EXPER) and the unemployment rate (UE). The level-2 model says

  • the intercept in the level-1 model varies by highest grade completed (HGC)
  • the EXPER coefficient in the level-1 model varies based on whether or not your race is black (BLACK)
  • the UE coefficient is fixed for all individuals

If we collapse the two levels into one model we get

Y_{ij} = \gamma_{00} + \gamma_{01}(HGC_{i}-9) + \gamma_{10}EXPER_{ij} + \gamma_{12}EXPER_{ij} \times BLACK_{i} + \gamma_{20}(UE_{ij}-7) + \xi_{0i} + \xi_{1i}EXPER_{ij} + \epsilon_{ij}

That’s not exactly fun to look at either, but the last few terms reveal the random effects. The \xi_{0i} is the random effect for the intercept, \xi_{1i} is the random effect for the EXPER slope parameter and \epsilon_{ij} is the residual error.

We can fit this model in R as follows:

wages <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/wages_pp.txt", 
          header=T, sep=",")
library(lme4)
model.a <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + (exper|id), 
          wages, REML=FALSE)

The key part is the stuff in the parentheses. It says EXPER - and the intercept by default - are the random effects, and that they're grouped by ID (ie, the individuals). This means that each individual has his own intercept and EXPER coefficient in the fitted model. Let's look at the model's fixed effects and the random effects for individual 1.

The model's fixed effects:

fixef(model.a)
(Intercept)       exper       hgc.9        ue.7 exper:black 
 1.74898840  0.04405394  0.04001100 -0.01195050 -0.01818322 

Random effects for first individual (ID = 31) in data:

ranef(model.a)$id[1,]
   (Intercept)      exper
31  -0.1833142 0.02014632

To see the final model for this individual we add his random effects to the fixed effects:

ranef(model.a)$id[1,] + fixef(model.a)
   (Intercept)      exper
31    1.565674 0.06420026

Or we can just do this:

coef(model.a2)$id[1,]
   (Intercept)      exper    hgc.9       ue.7 exper:black
31    1.565674 0.06420026 0.040011 -0.0119505 -0.01818322

Notice how the intercept and EXPER coefficient are different for the individual versus the fixed effects. Now we're usually less interested in the specific random effects (in this case there are 888 of them!) and more interested in their variances (or variance components). The variance component for the intercept is 0.051. The variance component for EXPER is 0.001. Those are pretty small but not negligible.

Having said all that, the goal of this exercise is to build the best model with discontinuities, which is largely done by deviance statistics. So let's work through the book's example and remember that everything I explained above is the baseline model. All subsequent models will build upon it.

The first model up adds a discontinuity in the intercept by including fixed and random effect for GED:

model.b <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + ged + 
                (exper + ged|id), wages, REML=FALSE)
anova(model.a,model.b)

To get a feel how GED affects the model, look at the records for individual 53:

wages[wages$id == 53,1:4]
   id   lnw exper ged
19 53 1.763 0.781   0
20 53 1.538 0.943   0
21 53 3.242 0.957   1
22 53 1.596 1.037   1
23 53 1.566 1.057   1
24 53 1.882 1.110   1
25 53 1.890 1.185   1
26 53 1.660 1.777   1

Notice how GED flips from 0 to 1 over time. Model B allows an individual's wage trajectory to shift in "elevation" at the point GED changes to 1. Hence the discontinuity. Should we allow this? Calling anova(model.a,model.b) helps us decide. In the output you'll see the p-value is less than 0.001. The null here is no difference between the models, i.e., the new explanatory variable in model B (GED) has no effect. So we reject the null and determine that an individual's wage trajectory may indeed display a discontinuity in elevation upon receipt of a GED.

Our next model is model B without GED random effects:

model.c <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                ged + (exper|id), wages, REML=FALSE)
anova(model.c,model.b)

This is our baseline model with an additional fixed effect for GED. Should we include random effects for GED? Again we test the null that there is no difference between model B and C by calling anova(model.c,model.b). Since model C is nested in model B and the p-value returned is about 0.005, we reject the null and decide to keep the GED random components.

The next two models explore a discontinuity in the slope of the wages trajectory but not the elevation. We do this by removing the GED fixed and random effects and replace them with POSTEXP fixed and random effects. Recall that POSTEXP is years in labor force from day of GED attainment. To see how it works, look at the records for individual 4384:

wages[wages$id == 4384,3:5]
     exper ged postexp
2206 0.096   0   0.000
2207 1.039   0   0.000
2208 1.726   1   0.000
2209 3.128   1   1.402
2210 4.282   1   2.556
2211 5.724   1   3.998
2212 6.024   1   4.298

The record where GED flips to 1 is the day he obtains his GED. The next record clocks the elapsed time in the workforce since obtaining his GED: 1.402 years. POSTEXP records this explicitly. But that's what EXPER records as well: 3.128 - 1.726 = 1.402. So these two variables record the passage of time in lockstep. It's just that EXPER records from the "beginning" and POSTEXP records from the day of GED attainment. Allowing this additional variable into the model allows the slope of the wages trajectory to suddenly change when a GED is obtained. OK, enough talking. Let's see if we need it.

model.d <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + postexp + 
                (exper + postexp|id), wages, REML=FALSE)
anova(model.d,model.a)

The p-value returned from the anova function is about 0.01. This says model D is an improvement over model A and that the trajectory slope of wages indeed changes upon receipt of GED.

The next model is model D without random effects for POSTEXP. So whereas previously we allowed the change in slope to vary across individuals (random), now we're saying the change in slope is uniform (fixed) for all individuals. In the R code this means removing POSTEXP from the random effects portion but keeping it as a fixed effect.

model.e <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + postexp + 
               (exper|id), wages, REML=FALSE)
anova(model.d,model.e)

The results from this anova test conclude no difference between the models (p-value = 0.34). This means we may not need to allow for POSTEXP random effects.

But before we go with that, let's fit a model that allows for discontinuity in both the slope and elevation of the individuals' wages trajectory. In other words, let's throw both GED and POSTEXP in the model as both fixed and random effects. And then let's compare the model with previous models to determine whether or not to keep each predictor.

model.f <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                postexp + ged + (postexp + ged + exper|id), 
                wages, REML=FALSE)

# compare models f and b to evaluate POSTEXP effect (NULL: no POSTEXP effect)
anova(model.b,model.f)

# compare models f and d to evaluate GED effect  (NULL: no GED effect)
anova(model.d,model.f)

In both anova tests, the p-values are very small, thus we reject the null in each case and retain the predictors. Therefore it appears that in the presence of the GED predictor that we do actually want to retain random effects for POSTEXP. But we're not done yet. Let's fit two more models each without the POSTEXP and GED random effects, respectively, and compare them to model F:

# MODEL G - Model F without POSTEXP variance component
model.g <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                postexp + ged + (ged + exper|id), wages, REML=FALSE)

# MODEL H - Model F without GED variance component
model.h <- lmer(lnw ~ exper + hgc.9 + exper:black + ue.7 + 
                postexp + ged + (postexp + exper|id), wages, REML=FALSE)
deviance(model.h)

# compare models g and h to model f to see if we should 
# keep POSTEXP and GED variance components
# NULL: do not need variance components
anova(model.g,model.f)
anova(model.h,model.f)

In both anova tests we get p-values less than 0.01 and reject the null. We should indeed keep the random effects. So our final model allows for both discontinuities in elevation and slope in the individuals' trajectories. Here's a super rough way we can visualize this in R:

# visual aid of discontinuity in slope and elevation
# made up model coefficients
b0 <- 3
b1 <- 5 
b2 <- 2
b3 <- 4
# variables
ind <- c(rep(0,10),rep(1,11)) # indicator of event
time <- c(1:10,10,11:20) # time
time2 <- c(rep(0,11),1:10)  # additional time tracking after event occurs
# create response
y <- b0 + b1*ind + b2*time + b3*time2
# plot response versus time
plot(time, y, type="l")

This gives us the following line plot:
alda_discontinuity

Notice the shift in elevation at 10 and then the change in slope after the shift in elevation.

Treating Time More Flexibly (Ch 5 of ALDA)

Through 4 chapters of Applied Longitudinal Data Analysis (ALDA), the data sets have had the following constraints:

  • Balanced – all subjects have the same number of measurements.
  • Time structured – all subjects measured at the same time.
  • Time-invariant predictors – predictors that do not change over time, such as gender or treatment group.

In chapter 5 these constraints are relaxed. We work with unbalanced datasets with variably-spaced measurements and time-varying predictors. As usual, the UCLA stats consulting site replicates the chapter’s examples in 18 different stats programs. I won’t redo their work, but I will give you my boiled-down-most-important-points that I took away from this chapter. I’ll also show a couple of examples using the lmer() function from the lme4 package.

Section 5.1 Variably Spaced Measurement Occasions
Analyzing data sets with variably spaced measurement occasions is no different than analyzing data sets with identical occasions across individuals (time structured).

Example with unstructured data set (variably spaced measurements)
Data: reading scores recorded at three different times (i.e., 3 waves of data)
Fit two unconditional growth models

reading <- read.table("http://www.ats.ucla.edu/stat/r/examples/
                      alda/data/reading_pp.txt", header=T, sep=",")
mat2 <- reading[ ,3:4]-6.5
dimnames(mat2)[[2]] <- c("agegrp.c", "age.c")  
reading <- cbind(reading, mat2)

library(lme4)
# forcing structure on data
lmer.agegrp <- lmer(piat ~ agegrp.c + (agegrp.c | id), reading, REML = FALSE)
summary(lmer.agegrp)
# using unstructured data
lmer.age <- lmer(piat ~ age.c + (age.c | id), reading, REML = FALSE)
summary(lmer.age)

The first model treats the data as structured. Instead of using child’s precise age, we are using their age classification group (6.5, 8.5, 10.5). The second model uses the child’s precise age. Notice the second model’s lower deviance: 1803 versus 1820. “Treating the unstructured data as though it is time-structured introduces error in the analysis – error that we can reduce by using the child’s age at testing as the temporal predictor.” (p. 145)

Lesson: never force an unstructured data set to be structured.

Section 5.2 Varying Numbers of Measurement Occasions
Section 5.1 concerned varying spacing of measurements. This section concerns varying number of measurements . AKA Unbalanced data. Multilevel modeling allows analysis of data sets with varying numbers of waves of data.

All subjects can contribute to a multilevel model regardless of how many waves of data they contribute. No special procedures are needed to fit a multilevel model to unbalanced data, provided it’s not too unbalanced (i.e., too many people with too few waves with respect to the complexity of your specified model).

Potential Problems with unbalanced data

  • The iterative estimation algorithms may not converge. This affects variance components, not fixed effects. “Estimation of variance components requires that enough people have sufficient data to allow quantification of within-person residual variation.” (p. 152)
  • Exceeding boundary constraints, such as negative variance components. Your output may have an estimate of 0 to indicate this. Simplifying your model by removing random effects is usually the fix.
  • Nonconvergence. This can result from poorly specified models and insufficient data. Can also result from the outcome variable’s scale (too small, make larger) or the temporal predictor’s variable scale (too brief, make longer)

5.3 Time-Varying Predictors

A time-varying predictor is a variable whose values may differ over time. Examples: hours worked per week, money earned per year, employment status. No special strategies are needed to include a time-varying predictor in a multilevel model.

Examples with time-varying predictor
Data: depression scores (cesd) for unemployed; status of employment (unemp; 0 or 1) changes over time

unemployment <- read.table("http://www.ats.ucla.edu/stat/r/examples/
                           alda/data/unemployment_pp.txt", header=T, sep=",")

# time-varying predictor is unemp
lmer.unb <- lmer(cesd ~ months + unemp + (months | id), 
                 unemployment, REML = FALSE)
summary(lmer.unb)

# allow effect of time-varying predictor (unemp) to vary over time
lmer.unc <- lmer(cesd ~ months + unemp*months + (months | id), 
                 unemployment, REML = FALSE)
summary(lmer.unc)

# constant slope for unemp=0, changing slope for unemp=1
lmer.und <- lmer(cesd ~ unemp + unemp:months + (unemp + unemp:months | id), 
                 unemployment, REML = FALSE)
summary(lmer.und)

5.3 Recentering the Effect of Time

Recentering time can produce interpretive advantages such as an intercept that represents initial status. Time can also be recentered in such a way to produce an intercept that represents final status. This is useful when final status is of special concern. Changes in recentering produce different intercept parameters but leave slope and deviance statistics unchanged. It can also lead to an intercept being significant when it previously was not (and vice versa).

A Probability Problem in Heredity – Part 3

In my previous two posts I showed worked solutions to problems 2.5 and 11.7 in Bulmer’s Principles of Statistics, both of which involve the characteristics of self-fertilizing hybrid sweet peas. It turns out that problem 11.8 also involves this same topic, so why not work it as well for completeness. The problem asks us to assume that we were unable to find an explicit solution for the maximum likelihood equation in problem 11.7 and to solve it by using the following iterative method:

\( \theta_{1} = \theta_{0} + \frac{S(\theta_{0})}{I(\theta_{0})} \)

where \( S(\theta_{0}) \) is the value of \( \frac{d \log L}{d\theta}\) evaluated at \( \theta_{0}\) and \( I(\theta_{0})\) is the value of \( -E(\frac{d^{2}\log L}{d\theta^{2}})\) evaluated at \( \theta_{0}\).

So we begin with \( \theta_{0}\) and the iterative method returns \( \theta_{1}\). Now we run the iterative method again starting with \( \theta_{1}\) and get \( \theta_{2}\):

\( \theta_{2} = \theta_{1} + \frac{S(\theta_{1})}{I(\theta_{1})} \)

We repeat this process until we converge upon a value. This is called the Newton-Raphson method. Naturally this is something we would like to have the computer do for us.

First, recall our formulas from problem 11.7:

\( \frac{d \log L}{d\theta} = \frac{1528}{2 + \theta} – \frac{223}{1 – \theta} + \frac{381}{\theta} \)
\( \frac{d^{2}\log L}{d \theta^{2}} = -\frac{1528}{(2 + \theta)^{2}} -\frac{223}{(1 – \theta)^{2}} -\frac{381}{\theta^{2}} \)

Let’s write functions for those in R:

# maximum likelihood score
mls <- function(x) {
	1528/(2 + x) - 223/(1 - x) + 381/x
	}
# the information
inf <- function(x) {
	-1528/((2 + x)^2) - 223/((1 - x)^2) - 381/(x^2)
	}

Now we can use those functions in another function that will run the iterative method starting at a trial value:

# newton-raphson using expected information matrix
nr <- function(th) {
 prev <- th
 repeat {
   new <- prev + mls(prev)/-inf(prev)
   if(abs(prev - new)/abs(new) <0.0001)
     break
   prev <- new
  }
new
}	

This function first takes its argument and names it "prev". Then it starts a repeating loop. The first thing the loop does it calculate the new value using the iterative formula. It then checks to see if the difference between the new and previous value - divided by the new value - is less than 0.0001. If it is, the loop breaks and the "new" value is printed to the console. If not, the loop repeats. Notice that each iteration is hopefully converging on a value. As it converges, the difference between the "prev" and "new" value will get smaller and smaller. So small that dividing the difference by the "new" value (or "prev" value for that matter) will begin to approach 0.

To run this function, we simply call it from the console. Let's start with a value of \( \theta_{0} = \frac{1}{4}\), as the problem suggests:

nr(1/4)
[1] 0.7844304

There you go! We could make the function tell us a little more by outputting the iterative values and number of iterations. Here's a super quick and dirty way to do that:

# newton-raphson using expected information matrix
nr <- function(th) {
 k <- 1 # number of iterations
 v <- c() # iterative values
  prev <- th
  repeat {
    new <- prev + mls(prev)/-inf(prev)
    v[k] <- new
    if(abs(prev - new)/abs(new) <0.0001)
     break
    prev <- new
    k <- k + 1
    }
print(new) # the value we converged on
print(v) # the iterative values
print(k) # number of iterations
}

Now when we run the function we get this:

nr(1/4)
[1] 0.7844304
[1] 0.5304977 0.8557780 0.8062570 0.7863259 0.7844441 0.7844304
[1] 6

We see it took 6 iterations to converge. And with that I think I've had my fill of heredity problems for a while.

A Probability Problem in Heredity – Part 2

In my last post I solved a problem from chapter 2 of M.G. Bulmer’s Principles of Statistics. In this post I work through a problem in chapter 11 that is basically a continuation of the chapter 2 problem. If you take a look at the previous post, you’ll notice we were asked to find probability in terms of theta. I did it and that’s nice and all, but we can go further. We actually have data, so we can estimate theta. And that’s what the problem in chapter 11 asks us to do. If you’re wondering why it took 9 chapters to get from finding theta to estimating theta, that’s because the first problem involved basic probability and this one requires maximum likelihood. It’s a bit of a jump where statistical background is concerned.

The results of the last post were as follows:

purple-flowered red-flowered
long pollen \( \frac{1}{4}(\theta + 2)\) \( \frac{1}{4}(1 – \theta) \)
round pollen \( \frac{1}{4}(1 – \theta) \) \( \frac{1}{4}\theta \)

The table provides probabilities of four possible phenotypes when hybrid sweet peas are allowed to self-fertilize. For example, the probability of a self-fertilizing sweet pea producing a purple-flower with long pollen is \( \frac{1}{4}(1 – \theta)\). In this post we’ll estimate theta from our data. Recall that \( \theta = (1 – \pi)^{2} \), where \( \pi \) is the probability of the dominant and recessive genes of a characteristic switching chromosomes.

Here’s the data:

Purple-flowered Red-flowered
Long pollen 1528 117
Round pollen 106 381

We see from the table there are 4 exclusive possibilities when the sweet pea self-fertilizes. If we think of each possibility having its own probability of occurrence, then we can think of this data as a sample from a multinomial distribution. Since chapter 11 covers maximum likelihood estimation, the problem therefore asks us to use the multinomial likelihood function to estimate theta.

Now the maximum likelihood estimator for each probability is \( \hat{p_{i}} = \frac{x_{i}}{n} \). But we can’t use that. That’s estimating four parameters. We need to estimate one parameter, theta. So we need to go back to the multinomial maximum likelihood function and define \( p_{i}\) in terms of theta. And of course we’ll work with the log likelihood since it’s easier to work with sums than products.

\( \log L(\theta) = y_{1} \log p_{1} + y_{2} \log p_{2} + y_{3} \log p_{3} + y_{4} \log p_{4} \)

If you’re not sure how I got that, google “log likelihood multinomial distribution” for more PDF lecture notes than you can ever hope to read.

Now let’s define the probabilities in terms of one parameter, theta:

\( \log L(\theta) = y_{1} \log f_{1}(\theta) + y_{2} \log f_{2}(\theta) + y_{3} \log f_{3}(\theta) + y_{4} \log f_{4}(\theta) \)

Now take the derivative. Once we have that we can set equal to 0 and find a solution for theta. The solution will be the point at which theta obtains its maximum value:

\( \frac{d \log L(\theta)}{d\theta} = \frac{y_{1}}{f_{1}(\theta)} f’_{1}(\theta) + \frac{y_{1}}{f_{2}(\theta)} f’_{2}(\theta) + \frac{y_{1}}{f_{3}(\theta)} f’_{3}(\theta) + \frac{y_{1}}{f_{4}(\theta)} f’_{4}(\theta)\)

Time to go from the abstract to the applied with our values. The y’s are the data from our table and the functions of theta are the results from the previous problem.

\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{1/4(2 + \theta)} \frac{1}{4} – \frac{117}{1/4(1 – \theta)}\frac{1}{4} – \frac{106}{1/4(1 – \theta)}\frac{1}{4} + \frac{381}{1/4(\theta)} \frac{1}{4} \)
\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{2 + \theta} – \frac{117}{1 – \theta} – \frac{106}{1 – \theta} + \frac{381}{\theta} \)
\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{2 + \theta} – \frac{223}{1 – \theta} + \frac{381}{\theta} \)

Set equal to 0 and solve for theta. Beware, it gets messy.

\( \frac{[1528(1 – \theta)\theta] – [223(2 + \theta)\theta] + [381(2 + \theta)(1 – \theta)]}{(2 + \theta)(1- \theta)\theta} = 0\)

Yeesh. Fortunately the denominator cancels out when we start multiplying terms and solving for theta. So we’re left with this:

\( 1528\theta – 1528\theta^{2} – 446\theta – 223\theta^{2} + 762 – 381\theta – 381\theta^{2} = 0\)

And that reduces to the following quadratic equation:

\( -2132\theta^{2} + 701\theta + 762 = 0\)

I propose we use an online calculator to solve this equation. Here’s a good one. Just plug in the coefficients and hit solve to find the roots. Our coefficients are a = -2132, b = 701, and c = 762. Since it’s a quadratic equation we get two answers:

\( x_{1} = -0.4556 \)
\( x_{2} = 0.7844 \)

The answer is in terms of a probability which is between 0 and 1, so we toss the negative answer and behold our maximum likelihood estimator for theta: 0.7844.

Remember that \( \theta = (1 – \pi)^{2}\). If we solve for pi, we get \( \pi = 1 – \theta^{1/2}\), which works out to be 0.1143. That is, we estimate the probability of characteristic genes switching chromosomes to be about 11%. Therefore we can think of theta as the probability of having two parents that experienced no gene switching.

Now point estimates are just estimates. We would like to know how good the estimate is. That’s where confidence intervals come in to play. Let’s calculate one for theta.

It turns out that we can estimate the variability of our maximum likelihood estimator as follows:

\( V(\theta) = \frac{1}{I(\theta)}\), where \( I(\theta) = -E(\frac{d^{2}\log L}{d \theta^{2}}) \)

We need to determine the second derivative of our log likelihood equation, take the expected value by plugging in our maximum likelihood estimator, multiply that by -1, and then take the reciprocal. The second derivative works out to be:

\( \frac{d^{2}\log L}{d \theta^{2}} = -\frac{1528}{(2 + \theta)^{2}} -\frac{223}{(1 – \theta)^{2}} -\frac{381}{\theta^{2}} \)

The negative expected value of the second derivative is obtained by plugging in our estimate of 0.7844 and multiplying by -1. Let’s head to the R console to calculate this:

th <- 0.7844 # our ML estimate
(I <- -1 * (-1528/((2+th)^2) - 223/((1-th)^2) - 381/(th^2))) # information
[1] 5613.731

Now take the reciprocal and we have our variance:

(v.th <- 1/I)
[1] 0.0001781347

We can take the square root of the variance to get the standard error and multiply by 1.96 to get the margin of error for our estimate. Then add/subtract the margin of error to our estimate for a confidence interval:

# confidence limits on theta
0.784+(1.96*sqrt(v.th)) # upper bound
[1] 0.8101596
0.784-(1.96*sqrt(v.th)) # lower bound
[1] 0.7578404

Finally we convert the confidence interval for theta to a confidence interval for pi:

# probability of switching over
th.ub <- 0.784+(1.96*sqrt(v.th))
th.lb <- 0.784-(1.96*sqrt(v.th))
1-sqrt(th.ub) # lower bound
[1] 0.09991136
1-sqrt(th.lb) # upper bound
[1] 0.1294597

The probability of genes switching chromosomes is most probably in the range of 10% to 13%.

A Probability Problem in Heredity

Here’s a fun problem in heredity from the Dover classic Principles of Statistics by M.G. Bulmer. It’s from chapter 2, problem 2.5.

The results of Table 7 on p. 25 (see below) can be explained on the assumption that the genes for flower colour and pollen shape are on the same chromosome but that there is a probability π that one of the genes will be exchanged for the corresponding gene on the other chromosome. If we denote the genes for purple or red flowers by P and p, and the genes for long and round pollen by L and l, then the hybrids from the cross considered will all be of the genotype PL/pl, the notation indicating that the P and L genes are on one chromosome and the p and l genes on the other. When these hybrids are allowed to self-fertilise, there is a chance π that the L and l genes will interchange in one parent, giving Pl/pL; there are therefore really three mating types, PL/pl X PL/pl, Pl/pL X PL/pl and Pl/pL x Pl/pL, which occur with probabilities \( (1 – \pi)^{2} \), \( 2\pi(1 – \pi) \) and \(\pi^{2}\) respectively. Find the probabilities of the four possible phenotypes resulting from the experiment in terms of \( \theta = (1 – \pi)^{2} \).

Here’s the table the problem refers to:

Purple-flowered Red-flowered Total
Long pollen 1528 117 1645
Round pollen 106 381 487
Total 1634 498 2132

The interesting thing about that table is that it violates Mendel’s law of independent assortment. If it obeyed that law, then the probabilities of the resulting phenotypes would be:

Purple-flowered Red-flowered
Long pollen \( \frac{9}{16} \) \( \frac{3}{16} \)
Round pollen \( \frac{3}{16} \) \( \frac{1}{16} \)

We’d expect the purple/long pollen flower to happen about 9/16 = 56% of the time. Instead we see it occurring about 1528/2132 = 72% of the time. As the problem explains this has to do with the way genes are carried on chromosomes. This means we can’t calculate probabilities as you normally would for a dihybrid cross. Therefore it asks us to calculate those probabilities conditional on whether or not the gene for pollen switched chromosomes. Our answer will take the form above, but instead of actual numbers, we’ll express our answer in terms of \( \theta = (1 – \pi)^{2} \). In other words theta is the probability that the gene for pollen did not switch chromosomes.

The hard part of this problem is setting it up. First, we have to recognize that we don’t know which chromosome has the characteristics. The characteristics of mating type PL/pl X PL/pl could both be on the PL chromosomes, in which case the result would be a PL, or a purple/long pollen flower. Or one could be PL and the other pl, which would still be PL, a purple/long pollen flower, since purple and long pollen are dominant characteristics. If both are on the pl chromosome, then the result would be pl, a red/round pollen flower. All told, for the PL/pl X PL/pl mating type we have the following possibilities:

PL/pl X PL/pl mating table (p = \( (1 – \pi)^{2}\) )
PL pl
PL PL PL
pl PL pl

For this mating type we get a purple/long pollen flower (PL) three out of four times and a red/round pollen flower (pl) one out of four times. We need to construct similar tables for the other two mating types. But notice the Pl/pL X PL/pl mating type can actually happen in reverse order as PL/pl X Pl/pL, so it needs two tables. Therefore we really need three more tables:

Pl/pL X PL/pl mating table (p = \( \pi(1 – \pi)\) )
PL pl
Pl PL Pl
pL PL pL
PL/pl X Pl/pL mating table (p = \( \pi(1 – \pi)\) )
Pl pL
PL PL PL
pl Pl pL
Pl/pL X Pl/pL mating table (p = \( \pi^{2}\) )
Pl pL
Pl Pl PL
pL PL pL

Have to give a quick shout out to http://truben.no/latex/table/ for helping me make those tables! Anyway…so we have 4 tables displaying 16 possible results:

  • 9 out of 16 times you get PL, a purple/long pollen flower
  • 3 out of 16 times you get Pl, a purple/round pollen flower
  • 3 out of 16 times you get pL, a red/long pollen flower
  • 1 out of 16 times you get pl, a red/round pollen flower

If the results were independent, we could just call those probabilities. But they’re not. That’s the whole point of the problem. We have to take into account the probability π of the exchange of genes from one chromosome to the other. Let’s reset the probabilities for the four mating types:

  1. PL/pl X PL/pl = \( (1-\pi)^{2} \)
  2. Pl/pL X PL/pl = \( \pi(1-\pi) \)
  3. PL/pl X pl/pL = \( \pi(1-\pi) \)
  4. Pl/pL X Pl/pL = \( \pi^{2} \)

So the probability of getting pl in the first mating type is \( \frac{1}{4}(1-\pi)^{2} \). Recall the problem asks us to find these probabilities in terms of \( \theta = (1 – \pi)^{2} \), so we can express this as \( \frac{1}{4}\theta \). And there’s one of our answers since pl does not occur in any of the other mating types.

Now we just need to find the other probabilities. To make life easier, let’s go ahead and convert all the probabilities in terms of \( \theta \):

  • \( (1 – \pi)^2 = \theta \)
  • \( (1 – \pi) = \theta^{1/2} \)
  • \( \pi = 1 – \theta^{1/2} \)
  • \( \pi^{2} = (1- \theta^{1/2})^{2} \)

The hardest one to find is PL:

\( PL = \frac{3}{4}\theta + \frac{2}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{2}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{2}{4}(1 – \theta^{1/2})^{2} \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2(1 – \theta^{1/2})\theta^{1/2} + 1 – \theta^{1/2} – \theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2(\theta^{1/2} – \theta) + 1 – 2\theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2\theta^{1/2} – 2\theta + 1 – 2\theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}(1 – \theta) \)
\( PL = \frac{3}{4}\theta + \frac{2}{4} – \frac{2}{4}\theta \)
\( PL = \frac{1}{4}\theta + \frac{2}{4} \)
\( PL = \frac{1}{4}(\theta + 2)\)

Next up is pL:

\( pL = \frac{1}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{1}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{1}{4}(1 – \theta^{1/2})^{2} \)
\( pL = \frac{1}{4}(\theta^{1/2} – \theta + \theta^{1/2} – \theta + 1 – \theta^{1/2} – \theta^{1/2} + \theta) \)
\( pL = \frac{1}{4}(-2\theta + 1 + \theta) \)
\( pL = \frac{1}{4}(1 – \theta) \)

That just leaves Pl. But if you look at the tables above, you’ll notice Pl appears in the same tables with pL the same number of times. So if we set up an equation to find its probability, we’ll get the same equation we started with when we solved Pl. That means we’ll get the same answer. So we don’t need to solve it. We already know it: \( pL = \frac{1}{4}(1 – \theta) \)

And that finishes the problem. The probabilities of the four possible phenotypes are as follows:

purple-flowered red-flowered
long pollen \( \frac{1}{4}(\theta + 2) \) \( \frac{1}{4}(1 – \theta) \)
round pollen \( \frac{1}{4}(1 – \theta) \) \( \frac{1}{4}\theta \)