# Most useful tests for an ANCOVA model

In his book Regression Modeling Strategies, 2nd Ed, Frank Harrell provides a list of what he calls the “most useful tests” for a 2-level factor $$\times$$ numeric model (Table 2.2, p. 19). This is often called an Analysis of Covariance, or ANCOVA. The basic idea is we have a numeric response with substantial variability and we seek to understand the variability by modeling the mean of the response as a function of a categorical variable and a numeric variable.

Let’s simulate some data for such a model and then see how we can use R to carry out these tests.

n <- 400
set.seed(1)
sex <- factor(sample(x = c("f", "m"), size = n, replace = TRUE))
age <- round(runif(n = n, min = 18, max = 65))
y <- 1 + 0.8*age + 0.4*(sex == "m") - 0.7*age*(sex == "m") + rnorm(n, mean = 0, sd = 8)
dat <- data.frame(y, age, sex)


The data contain a numeric response, y, that is a function of age and sex. I set the “true” coefficient values to 1, 0.8, 0.4, and -0.7. They correspond to $$\beta_0$$ through $$\beta_3$$ in the following model:

$y = \beta_0 + \beta_1 age + \beta_2 sex + \beta_3 age \times sex$

In addition the error component is a Normal distribution with a standard deviation of 8.

Now let’s model the data and see how close we get to recovering the true parameter values.

mod <- lm(y ~ age * sex, dat)
summary(mod)

##
## Call:
## lm(formula = y ~ age * sex, data = dat)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -23.8986  -5.8552  -0.2503   6.0507  30.6188
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.27268    1.93776   0.141    0.888
## age          0.79781    0.04316  18.484   <2e-16 ***
## sexm         2.07143    2.84931   0.727    0.468
## age:sexm    -0.72702    0.06462 -11.251   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.661 on 396 degrees of freedom
## Multiple R-squared:  0.7874, Adjusted R-squared:  0.7858
## F-statistic:   489 on 3 and 396 DF,  p-value: < 2.2e-16


While the coefficient estimates for age and the age $$\times$$ sex interaction are pretty close to the true values, the same cannot be said for the intercept and sex coefficients. The residual standard error of 8.661 is close to the true value of 8.

We can see in the summary output of the model that four hypothesis tests, one for each coefficient, are carried out for us. Each are testing if the coefficient is equal to 0. Of those four, only one qualifies as one of the most useful tests: the last one for age:sexm. This tests if the effect of age is independent of sex and vice versa. Stated two other ways, it tests if age and sex are additive, or if the age effect is the same for both sexes. To get a better understanding of what we’re testing, let’s plot the data with fitted age slopes for each sex.

library(ggplot2)
ggplot(dat, aes(x = age, y = y, color = sex)) +
geom_point() +
geom_smooth(method="lm")


Visually it appears the effect of age is not independent of sex. It seems more pronounced for females. Is this effect real or maybe due to chance? The hypothesis test in the summary output for age:sexm evaluates this. Obviously the effect seems very real. We are not likely to see such a difference in slopes this large if there truly was no difference. It does appear the effect of age is different for each sex. The estimate of -0.72 estimates the difference in slopes (or age effect) for the males and females.

The other three hypothesis tests are not very useful.

• Testing if the Intercept is 0 is testing whether y is 0 for females at age 0.
• Testing if age is 0 is testing whether age is associated with y for males.
• Testing if sexm is 0 is testing whether sex is associated with y for subjects at age 0.

Other more useful tests, as Harrell outlines in Table 2.2, are as follows:

• Is age associated with y?
• Is sex associated with y?
• Are either age or sex associated with y?

The last one is answered in the model output. That’s the F-statistic in the last line. It tests whether all coefficients (except the intercept) are equal to 0. The result of this test is conclusive. At least one of the coeffcients is not 0.

To test if age is associated with y, we need to test if both the age and age:sexm coefficents are equal to 0. The car package by John Fox provides a nice function for this purpose called linearHypothesis. It takes at least two arguments. The first is the fitted model object and the second is a vector of hypothesis tests. Below we specify we want to test if “age = 0” and “age:sexm = 0”

library(car)
linearHypothesis(mod, c("age = 0", "age:sexm = 0"))

## Linear hypothesis test
##
## Hypothesis:
## age = 0
## age:sexm = 0
##
## Model 1: restricted model
## Model 2: y ~ age * sex
##
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)
## 1    398 55494
## 2    396 29704  2     25790 171.91 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


The result is once again conclusive. The p-value is virtually 0. It does indeed appear that age is associated with y.

Likewise, to test if sex is associated with y, we need to test if both the sex and age:sexm coefficents are equal to 0.

linearHypothesis(mod, c("sexm = 0", "age:sexm = 0"))

## Linear hypothesis test
##
## Hypothesis:
## sexm = 0
## age:sexm = 0
##
## Model 1: restricted model
## Model 2: y ~ age * sex
##
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)
## 1    398 119354
## 2    396  29704  2     89651 597.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


As expected this test confirms that sex is associated with y, just as we specified when we simulated the data.

Now that we have established that age is associated with y, and that the association differs for each sex, what exactly is that association for each sex? In other words what are the slopes of the lines in our plot above?

We can sort of answer that with the model coefficients.

round(coef(mod),3)

## (Intercept)         age        sexm    age:sexm
##       0.273       0.798       2.071      -0.727


That corresponds to the following model:

$y = 0.273 + 0.799 age + 2.071 sex – 0.727 age \times sex$

When sex is female, the fitted model is

$y = 0.273 + 0.799 age$

This says the slope of the age is about 0.8 when sex is female.

When sex is male, the fitted model is

$y = (0.273 + 2.071) + (0.797 – 0.727) age$
$y = 2.344 + 0.07 age$

This says the slope of the age is about 0.07 when sex is male.

How certain are we about these estimates? That’s what standard error is for. For the age slope estimate for females the standard error is provided in the model output for the age coefficient. It shows about 0.04. Adding and subtracting 2 $$\times$$ 0.04 to the coefficient gives us a rough 95% confidence interval. Or we could just use the confint function:

confint(mod, parm = "age")

##         2.5 %    97.5 %
## age 0.7129564 0.8826672


The standard error of the age slope estimate for males takes a little more work. Another car function useful for this is the deltaMethod function. It takes at least three arguments: the model object, the quantity expressed as a character phrase that we wish to estimate a standard error for, and the names of the parameters. The function then calculates the standard error using the delta method. Here’s one way to do it for our model

deltaMethod(mod, "b1 + b3", parameterNames = paste0("b", 0:3))

##           Estimate         SE       2.5 %    97.5 %
## b1 + b3 0.07079277 0.04808754 -0.02345709 0.1650426


The standard error is similar in magnitude, but since our estimate is so small the resulting confidence interval overlaps 0. This tells us the effect of age on males is too small for our data to determine if the effect is positive or negative.

Another way to get the estimated age slopes for each sex, along with standard errors and confidence intervals, is to use the margins package. We use the margins function with our model object and specify that we want to estimate the marginal effect of age at each level of sex. (“marginal effect of age” is another way of saying the effect of age at each level of sex)

library(margins)
margins(mod, variables = "age", at = list(sex = c("f", "m")))

## Average marginal effects at specified values

## lm(formula = y ~ age * sex, data = dat)

##  at(sex)     age
##        f 0.79781
##        m 0.07079


This does the formula work we did above. It plugs in sex and returns the estmimated slope coefficient for age. If we wrap the call in summary we get the standard errors and confidence intervals.

summary(margins(mod, variables = "age", at = list(sex = c("f", "m"))))

##  factor    sex    AME     SE       z      p   lower  upper
##     age 1.0000 0.7978 0.0432 18.4841 0.0000  0.7132 0.8824
##     age 2.0000 0.0708 0.0481  1.4722 0.1410 -0.0235 0.1650


# Revisiting some old R-code

One of my first blog posts on this site simulated rolling a die 6 times and observing if side i was observed on the ith^ roll at least once. (For example, rolling a 3 on my 3rd roll would be a success.) I got the idea from one of my statistics textbooks which had a nice picture of the simulation converging to the true probability of 0.665 over the course of 500 trials. I was able to recreate the simulation in R and I guess I got excited and blogged about it. I was reminded of this today when I opened that old textbook and came across the R code that I had actually written in the margin by hand. Apparently I was super proud of my ground-breaking R code and decided to preserve it for posterity in writing. smh

Over 5 years later my precious R code looks pretty juvenile. It's way more complicated than it needs to be and doesn't take advantage of R's vectorized calculations. As Venables and Ripley say in MASS, “Users coming to S from other languages are often slow to take advantage of the power of S to do vectorized calculations…This often leads to unnecessary loops.” Indeed. I'm living proof of that.

I shouldn't be too hard on my myself though. I was not yet familiar with functions like replicate and cumsum. And I was more focused on recreating the plot than writing optimal R code. I went with what I knew. And R, so forgiving and flexible, accommodated my novice enthusiasm.

Here is how I would approach the problem today:

r.out <- replicate(n = 500, any(sample(1:6, size = 6, replace = T) == 1:6))
p.out <- cumsum(r.out)/seq(500)
plot(x = seq(500), y = p.out, type = "l", ylim = c(0,1),
main = "Convergence to probability as n increases", xlab = "n")
abline(h = 0.665)


On line 1, we use the sample function to “roll a die 6 times” by sampling the numbers 1 – 6, with replacement, 6 times. Then we compare the 6 results with the vector of numbers 1 – 6 using the == operator and use the any function to check if any are TRUE. Next we replicate that 500 times and store the result in r.out. This is a vector of TRUE/FALSE values which R treats numerically as 1 and 0. This means we can use cumsum to find the cumulative sum of successes. To determine the cumulative proportion of successes, we divide each cumulative sum by the trial number. The result is a vector of porportions that should start converging to 0.665. Finally we plot using base R plot and abline.

This is more efficient than my original attempt 5 years ago and better captures the spirit of the simulation. I'm sure 5 years from now if I stumble upon this post I'll have yet another more elegant way to do it. I'm already looking at it thinking, “I should have generalized this with a function, and used ggplot2 to make the graph. And I shouldn't do seq(500) twice.” In fact I know I could have avoided the replicate function by using the fact that there's a probablity of $$\frac{1}{6}$$ of observing side i on the ith^ roll of a die. So I could have used a single rbinom call to do the simulation, like so:

r.out2 <- cumsum(rbinom(n = 500, size = 6, prob = 1/6) > 0)
p.out2 <- r.out2/seq(500)
plot(x = seq(500), y = p.out2, type = "l", ylim = c(0,1),
main = "Convergence to probability as n increases", xlab = "n")
abline(h = 0.665)


In this version instead of simulating 6 literal die rolls, we simulate the number of successes in 6 die rolls. We turn each roll of the die into a binomial event: success or failure. The rbinom function allows us to simulate binomial events where size is the number of trials (or rolls in this case) and prob is the probability of success at each trial. So rbinom(n = 1, size = 6, prob = 1/6) would return a number ranging 0 to 6 indicating the number of success. Think of it as flipping 6 coins, each with probability of getting heads as $$\frac{1}{6}$$, and then counting the number of heads we observed. Setting the n argument to 500 replicates it 500 times. After that it's simply a matter of logically checking which outcomes were greater than 0 and using cumsum on the resulting TRUE/FALSE vector.

This version is way faster. I mean way faster. Compare the time it takes it to do each 1,000,000 times:

system.time({
r.out <- replicate(n = 1e6, any(sample(1:6, size = 6, replace = T) == 1:6))
p.out <- cumsum(r.out)/seq(1e6)
})

##    user  system elapsed
##    5.26    0.00    5.26

system.time({
r.out2 <- cumsum(rbinom(n = 1e6, size = 6, prob = (1/6)) > 0)
p.out2 <- r.out2/seq(1e6)
})

##    user  system elapsed
##    0.06    0.00    0.06


It's not even close. Who was the dummy that wrote that first version with replicate?

But does the new faster version reflect the experimental setting better? Not really. Remember, we're demonstrating probability concepts with die rolls in the first chapter of an intro stats textbook. That's probably not the best time to break out rbinom. And the demo was for 500 trials, not 1,000,000. I had to ramp up the trials to see the speed difference. Maybe the “right” R code in this situation is not the fastest version but rather the one that's easier to understand.

# Explaining and simulating an F distribution

I remember feeling very uncomfortable with the F distribution when I started learning statistics. It comes up when you’re learning ANOVA, which is rather complicated if you’re learning it for the first time. You learn about partitioning the sum of squares, determining degrees of freedom, calculating mean square error and mean square between-groups, putting everything in a table and finally calculating something called an F statistic. Oh, and don’t forget the assumptions of constant variance and normal distributions (although ANOVA is pretty robust to those violations). There’s a lot to take in! And that’s before you calculate a p-value. And how do you do that? You use an F distribution, which when I learned it meant I had to flip to a massive table in the back of my stats book. Where did that table come from and why do you use it for ANOVA? I didn’t know, I didn’t care. It was near the end of the semester and I was on overload.

I suspect this happens to a lot of people. You work hard to understand the normal distribution, hypothesis testing, confidence intervals, etc. throughout the semester. Then you get to ANOVA sometime in late November (or April) and it just deflates you. You will yourself to understand the rules of doing it, but any sort of intuition for it escapes you. You just want to finish the class.

This post attempts to provide an intuitive explanation of what an F distribution is with respect to one-factor ANOVA. Maybe someone will read this and benefit from it.

ANOVA stands for ANalysis Of VAriance. We’re analyzing variances to determine if there is a difference in means between more than 2 groups. In one-factor ANOVA we’re using one factor to determine group membership. Maybe we’re studying a new drug and create three groups: group A on 10 mg, group B on 5 mg, and group C on placebo. At the end of the study we measure some critical lab value. We take the mean of that lab value for the three groups. We wish to know if there is a difference in the means between those three population groups (assuming each group contains a random sample from three very large populations).

The basic idea is to estimate two variances and form a ratio of those variances. If the variances are about the same, the ratio will be close to 1 and we have no evidence that the populations means differ based on our sample.

One variance is simply the mean of the group variances. Say you have three groups. Take the variance of each group and then find the average of those variances. That’s called the mean square error (MSE). In symbol-free mathematical speak, for three groups we have:

MSE = variance(group A) + variance(group B) + variance(group C) / 3

The other variance is the variance of the sample means multiplied by the number of items in each group (assuming equal sample sizes in each group).Â  That’s called mean square between groups (MSB). It looks something like this:

MSB = variance(group means)*(n)

The F statistic is the ratio of these two variances: F = MSB/MSE

Now if the groups have the same means and same variance and are normally distributed, the F statistic has an F distribution. In other words, if we were to run our experiment hundreds and hundreds of times on three groups with the same mean and variance from a normal distribution, calculate the F statistic each time, and then make a histogram of all our F statistics, our histogram would have a shape that can be modeled with an F distribution.

That’s something we can do in R with the following code:

# generate 4000 F statistics for 5 groups with equal means and variances
R <- 4000
Fstat <- vector(length = R)
for (i in 1:R){
g1 <- rnorm(20,10,4)
g2 <- rnorm(20,10,4)
g3 <- rnorm(20,10,4)
g4 <- rnorm(20,10,4)
g5 <- rnorm(20,10,4)
mse <- (var(g1)+var(g2)+var(g3)+var(g4)+var(g5))/5
M <- (mean(g1)+mean(g2)+mean(g3)+mean(g4)+mean(g5))/5
msb <- ((((mean(g1)-M)^2)+((mean(g2)-M)^2)+((mean(g3)-M)^2)+((mean(g4)-M)^2)+((mean(g5)-M)^2))/4)*20
Fstat[i] <- msb/mse
}
# plot a histogram of F statistics and superimpose F distribution (4, 95)
ylim <- (range(0, 0.8))
x <- seq(0,6,0.01)
hist(Fstat,freq=FALSE, ylim=ylim)
curve(df(x,4,95),add=T) # 5 - 1 = 4; 100 - 5 = 95

So I have 5 normally distributed groups each with a mean of 10 and standard deviation of 4. I take 20 random samples from each and calculate the MSE and MSB as I outlined above. I then calculate the F statistic. And I repeat 4000 times. Finally I plot a histogram of my F statistics and super-impose a theoretical F distribution on top of it. I get the resulting figure:

This is the distribution of the F statistic when the groups samples come from Normal populations with identical means and variances. The smooth curve is an F distribution with 4 and 95 degrees of freedom. The 4 is Number of Groups - 1 (or 5 - 1). The 95 is from Total Number of Observations - Number of Groups (or 100 - 5). If we examine the figure we see that we most likely get an F statistic around 1. The bulk of the area under the curve is between 0.5 and 1.5. This is the distribution we would use to find a p-value for any experiment that involved 5 groups of 20 members each. When the populations means of the groups are different, we get an F statistic greater than 1. The bigger the differences, the larger the F statistic. The larger the F statistic, the more confident we are that the population means between the 5 groups are not the same.

There is much, much more that can be said about ANOVA. I'm not even scratching the surface. But what I wanted to show here is a visual of the distribution of the F statistic. Hopefully it'll give you a better idea of what exactly an F distribution is when you're learning ANOVA for the first time.