Category Archives: Hypothesis Testing

Most useful tests for an ANCOVA model

In his book Regression Modeling Strategies, 2nd Ed, Frank Harrell provides a list of what he calls the “most useful tests” for a 2-level factor \(\times\) numeric model (Table 2.2, p. 19). This is often called an Analysis of Covariance, or ANCOVA. The basic idea is we have a numeric response with substantial variability and we seek to understand the variability by modeling the mean of the response as a function of a categorical variable and a numeric variable.

Let’s simulate some data for such a model and then see how we can use R to carry out these tests.

n <- 400
set.seed(1)
sex <- factor(sample(x = c("f", "m"), size = n, replace = TRUE))
age <- round(runif(n = n, min = 18, max = 65))
y <- 1 + 0.8*age + 0.4*(sex == "m") - 0.7*age*(sex == "m") + rnorm(n, mean = 0, sd = 8)
dat <- data.frame(y, age, sex)

The data contain a numeric response, y, that is a function of age and sex. I set the “true” coefficient values to 1, 0.8, 0.4, and -0.7. They correspond to \(\beta_0\) through \(\beta_3\) in the following model:

\[y = \beta_0 + \beta_1 age + \beta_2 sex + \beta_3 age \times sex\]

In addition the error component is a Normal distribution with a standard deviation of 8.

Now let’s model the data and see how close we get to recovering the true parameter values.

mod <- lm(y ~ age * sex, dat)
summary(mod)
## 
## Call:
## lm(formula = y ~ age * sex, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.8986  -5.8552  -0.2503   6.0507  30.6188 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.27268    1.93776   0.141    0.888    
## age          0.79781    0.04316  18.484   <2e-16 ***
## sexm         2.07143    2.84931   0.727    0.468    
## age:sexm    -0.72702    0.06462 -11.251   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.661 on 396 degrees of freedom
## Multiple R-squared:  0.7874, Adjusted R-squared:  0.7858 
## F-statistic:   489 on 3 and 396 DF,  p-value: < 2.2e-16

While the coefficient estimates for age and the age \(\times\) sex interaction are pretty close to the true values, the same cannot be said for the intercept and sex coefficients. The residual standard error of 8.661 is close to the true value of 8.

We can see in the summary output of the model that four hypothesis tests, one for each coefficient, are carried out for us. Each are testing if the coefficient is equal to 0. Of those four, only one qualifies as one of the most useful tests: the last one for age:sexm. This tests if the effect of age is independent of sex and vice versa. Stated two other ways, it tests if age and sex are additive, or if the age effect is the same for both sexes. To get a better understanding of what we’re testing, let’s plot the data with fitted age slopes for each sex.

library(ggplot2)
ggplot(dat, aes(x = age, y = y, color = sex)) +
  geom_point() +
  geom_smooth(method="lm")

plot of chunk unnamed-chunk-3

Visually it appears the effect of age is not independent of sex. It seems more pronounced for females. Is this effect real or maybe due to chance? The hypothesis test in the summary output for age:sexm evaluates this. Obviously the effect seems very real. We are not likely to see such a difference in slopes this large if there truly was no difference. It does appear the effect of age is different for each sex. The estimate of -0.72 estimates the difference in slopes (or age effect) for the males and females.

The other three hypothesis tests are not very useful.

  • Testing if the Intercept is 0 is testing whether y is 0 for females at age 0.
  • Testing if age is 0 is testing whether age is associated with y for males.
  • Testing if sexm is 0 is testing whether sex is associated with y for subjects at age 0.

Other more useful tests, as Harrell outlines in Table 2.2, are as follows:

  • Is age associated with y?
  • Is sex associated with y?
  • Are either age or sex associated with y?

The last one is answered in the model output. That’s the F-statistic in the last line. It tests whether all coefficients (except the intercept) are equal to 0. The result of this test is conclusive. At least one of the coeffcients is not 0.

To test if age is associated with y, we need to test if both the age and age:sexm coefficents are equal to 0. The car package by John Fox provides a nice function for this purpose called linearHypothesis. It takes at least two arguments. The first is the fitted model object and the second is a vector of hypothesis tests. Below we specify we want to test if “age = 0” and “age:sexm = 0”

library(car)
linearHypothesis(mod, c("age = 0", "age:sexm = 0"))
## Linear hypothesis test
## 
## Hypothesis:
## age = 0
## age:sexm = 0
## 
## Model 1: restricted model
## Model 2: y ~ age * sex
## 
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    398 55494                                  
## 2    396 29704  2     25790 171.91 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result is once again conclusive. The p-value is virtually 0. It does indeed appear that age is associated with y.

Likewise, to test if sex is associated with y, we need to test if both the sex and age:sexm coefficents are equal to 0.

linearHypothesis(mod, c("sexm = 0", "age:sexm = 0"))
## Linear hypothesis test
## 
## Hypothesis:
## sexm = 0
## age:sexm = 0
## 
## Model 1: restricted model
## Model 2: y ~ age * sex
## 
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    398 119354                                 
## 2    396  29704  2     89651 597.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As expected this test confirms that sex is associated with y, just as we specified when we simulated the data.

Now that we have established that age is associated with y, and that the association differs for each sex, what exactly is that association for each sex? In other words what are the slopes of the lines in our plot above?

We can sort of answer that with the model coefficients.

round(coef(mod),3)
## (Intercept)         age        sexm    age:sexm 
##       0.273       0.798       2.071      -0.727

That corresponds to the following model:

\[y = 0.273 + 0.799 age + 2.071 sex – 0.727 age \times sex\]

When sex is female, the fitted model is

\[y = 0.273 + 0.799 age \]

This says the slope of the age is about 0.8 when sex is female.

When sex is male, the fitted model is

\[y = (0.273 + 2.071) + (0.797 – 0.727) age \]
\[y = 2.344 + 0.07 age \]

This says the slope of the age is about 0.07 when sex is male.

How certain are we about these estimates? That’s what standard error is for. For the age slope estimate for females the standard error is provided in the model output for the age coefficient. It shows about 0.04. Adding and subtracting 2 \(\times\) 0.04 to the coefficient gives us a rough 95% confidence interval. Or we could just use the confint function:

confint(mod, parm = "age")
##         2.5 %    97.5 %
## age 0.7129564 0.8826672

The standard error of the age slope estimate for males takes a little more work. Another car function useful for this is the deltaMethod function. It takes at least three arguments: the model object, the quantity expressed as a character phrase that we wish to estimate a standard error for, and the names of the parameters. The function then calculates the standard error using the delta method. Here’s one way to do it for our model

deltaMethod(mod, "b1 + b3", parameterNames = paste0("b", 0:3)) 
##           Estimate         SE       2.5 %    97.5 %
## b1 + b3 0.07079277 0.04808754 -0.02345709 0.1650426

The standard error is similar in magnitude, but since our estimate is so small the resulting confidence interval overlaps 0. This tells us the effect of age on males is too small for our data to determine if the effect is positive or negative.

Another way to get the estimated age slopes for each sex, along with standard errors and confidence intervals, is to use the margins package. We use the margins function with our model object and specify that we want to estimate the marginal effect of age at each level of sex. (“marginal effect of age” is another way of saying the effect of age at each level of sex)

library(margins)
margins(mod, variables = "age", at = list(sex = c("f", "m")))
## Average marginal effects at specified values
## lm(formula = y ~ age * sex, data = dat)
##  at(sex)     age
##        f 0.79781
##        m 0.07079

This does the formula work we did above. It plugs in sex and returns the estmimated slope coefficient for age. If we wrap the call in summary we get the standard errors and confidence intervals.

summary(margins(mod, variables = "age", at = list(sex = c("f", "m"))))
##  factor    sex    AME     SE       z      p   lower  upper
##     age 1.0000 0.7978 0.0432 18.4841 0.0000  0.7132 0.8824
##     age 2.0000 0.0708 0.0481  1.4722 0.1410 -0.0235 0.1650

Profile likelihood ratio confidence intervals

When you fit a generalized linear model (GLM) in R and call confint on the model object, you get confidence intervals for the model coefficients. But you also get an interesting message:

Waiting for profiling to be done...

What's that all about? What exactly is being profiled? Put simply, it's telling you that it's calculating a profile likelihood ratio confidence interval.

The typical way to calculate a 95% confidence interval is to multiply the standard error of an estimate by some normal quantile such as 1.96 and add/subtract that product to/from the estimate to get an interval. In the context of GLMs, we sometimes call that a Wald confidence interval.

Another way to determine an upper and lower bound of plausible values for a model coefficient is to find the minimum and maximum value of the set of all coefficients that satisfy the following:

\[-2\log\left(\frac{L(\beta_{0}, \beta_{1}|y_{1},…,y_{n})}{L(\hat{\beta_{0}}, \hat{\beta_{1}}|y_{1},…,y_{n})}\right) < \chi_{1,1-\alpha}^{2}\]

Inside the parentheses is a ratio of likelihoods. In the denominator is the likelihood of the model we fit. In the numerator is the likelihood of the same model but with different coefficients. (More on that in a moment.) We take the log of the ratio and multiply by -2. This gives us a likelihood ratio test (LRT) statistic. This statistic is typically used to test whether a coefficient is equal to some value, such as 0, with the null likelihood in the numerator (model without coefficient, that is, equal to 0) and the alternative or estimated likelihood in the denominator (model with coefficient). If the LRT statistic is less than \(\chi_{1,0.95}^{2} \approx 3.84\), we fail to reject the null. The coefficient is statisically not much different from 0. That means the likelihood ratio is close to 1. The likelihood of the model without the coefficient is almost as high the model with it. On the other hand, if the ratio is small, that means the likelihood of the model without the coefficient is much smaller than the likelihood of the model with the coefficient. This leads to a larger LRT statistic since it's being log transformed, which leads to a value larger than 3.84 and thus rejection of the null.

Now in the formula above, we are seeking all such coefficients in the numerator that would make it a true statement. You might say we're “profiling” many different null values and their respective LRT test statistics. Do they fit the profile of a plausible coefficient value in our model? The smallest value we can get without violating the condition becomes our lower bound, and likewise with the largest value. When we're done we'll have a range of plausible values for our model coefficient that gives us some indication of the uncertainly of our estimate.

Let's load some data and fit a binomial GLM to illustrate these concepts. The following R code comes from the help page for confint.glm. This is an example from the classic Modern Applied Statistics with S. ldose is a dosing level and sex is self-explanatory. SF is number of successes and failures, where success is number of dead worms. We're interested in learning about the effects of dosing level and sex on number of worms killed. Presumably this worm is a pest of some sort.

# example from Venables and Ripley (2002, pp. 190-2.)
ldose <- rep(0:5, 2)
numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
sex <- factor(rep(c("M", "F"), c(6, 6)))
SF <- cbind(numdead, numalive = 20-numdead)
budworm.lg <- glm(SF ~ sex + ldose, family = binomial)
summary(budworm.lg)
## 
## Call:
## glm(formula = SF ~ sex + ldose, family = binomial)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.10540  -0.65343  -0.02225   0.48471   1.42944  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4732     0.4685  -7.413 1.23e-13 ***
## sexM          1.1007     0.3558   3.093  0.00198 ** 
## ldose         1.0642     0.1311   8.119 4.70e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 124.8756  on 11  degrees of freedom
## Residual deviance:   6.7571  on  9  degrees of freedom
## AIC: 42.867
## 
## Number of Fisher Scoring iterations: 4

The coefficient for ldose looks significant. Let's determine a confidence interval for the coefficient using the confint function. We call confint on our model object, budworm.lg and use the parm argument to specify that we only want to do it for ldose:

confint(budworm.lg, parm = "ldose")
## Waiting for profiling to be done...
##     2.5 %    97.5 % 
## 0.8228708 1.3390581

We get our “waiting” message though there really was no wait. If we fit a larger model and request multiple confidence intervals, then there might actually be a waiting period of a few seconds. The lower bound is about 0.8 and the upper bound about 1.32. We might say every increase in dosing level increase the log odds of killing worms by at least 0.8. We could also exponentiate to get a CI for an odds ratio estimate:

exp(confint(budworm.lg, parm = "ldose"))
## Waiting for profiling to be done...
##    2.5 %   97.5 % 
## 2.277027 3.815448

The odds of “success” (killing worms) is at least 2.3 times higher at one dosing level versus the next lower dosing level.

To better understand the profile likelihood ratio confidence interval, let's do it “manually”. Recall the denominator in the formula above was the likelihood of our fitted model. We can extract that with the logLik function:

den <- logLik(budworm.lg)
den
## 'log Lik.' -18.43373 (df=3)

The numerator was the likelihood of a model with a different coefficient. Here's the likelihood of a model with a coefficient of 1.05:

num <- logLik(glm(SF ~ sex + offset(1.05*ldose), family = binomial))
num
## 'log Lik.' -18.43965 (df=2)

Notice we used the offset function. That allows us to fix the coefficient to 1.05 and not have it estimated.

Since we already extracted the log likelihoods, we need to subtract them. Remember this rule from algebra?

\[\log\frac{M}{N} = \log M – \log N\]

So we subtract the denominator from the numerator, multiply by -2, and check if it's less than 3.84, which we calculate with qchisq(p = 0.95, df = 1)

-2*(num - den)
## 'log Lik.' 0.01184421 (df=2)
-2*(num - den) < qchisq(p = 0.95, df = 1)
## [1] TRUE

It is. 1.05 seems like a plausible value for the ldose coefficient. That makes sense since the estimated value was 1.0642. Let's try it with a larger value, like 1.5:

num <- logLik(glm(SF ~ sex + offset(1.5*ldose), family = binomial))
-2*(num - den) < qchisq(p = 0.95, df = 1)
## [1] FALSE

FALSE. 1.5 seems too big to be a plausible value for the ldose coefficient.

Now that we have the general idea, we can program a while loop to check different values until we exceed our threshold of 3.84.

cf <- budworm.lg$coefficients[3]  # fitted coefficient 1.0642
cut <- qchisq(p = 0.95, df = 1)   # about 3.84
e <- 0.001                        # increment to add to coefficient
LR <- 0                           # to kick start our while loop 
while(LR < cut){
  cf <- cf + e
  num <- logLik(glm(SF ~ sex + offset(cf*ldose), family = binomial))
  LR <- -2*(num - den)
}
(upper <- cf)
##    ldose 
## 1.339214

To begin we save the original coefficient to cf, store the cutoff value to cut, define our increment of 0.001 as e, and set LR to an initial value of 0. In the loop we increment our coefficient estimate which is used in the offset function in the estimation step. There we extract the log likelihood and then calculate LR. If LR is less than cut (3.84), the loop starts again with a new coefficient that is 0.001 higher. We see that our upper bound of 1.339214 is very close to what we got above using confint (1.3390581). If we set e to smaller values we'll get closer.

We can find the LR profile lower bound in a similar way. Instead of adding the increment we subtract it:

cf <- budworm.lg$coefficients[3]  # reset cf
LR <- 0                           # reset LR 
while(LR < cut){
  cf <- cf - e
  num <- logLik(glm(SF ~ sex + offset(cf*ldose), family = binomial))
  LR <- -2*(num - den)
}
(lower <- cf)
##    ldose 
## 0.822214

The result, 0.822214, is very close to the lower bound we got from confint (0.8228708).

This is a very basic implementation of calculating a likelihood ratio confidence interval. It is only meant to give a general sense of what's happening when you see that message Waiting for profiling to be done.... I hope you found it helpful. To see how R does it, enter getAnywhere(profile.glm) in the console and inspect the code. It's not for the faint of heart.

I have to mention the book Analysis of Categorical Data with R, from which I gained a better understanding of the material in this post. The authors have kindly shared their R code at the following web site if you want to have a look: http://www.chrisbilder.com/categorical/

To see how they “manually” calculate likelihood ratio confidence intervals, go to the following R script and see the section “Examples of how to find profile likelihood ratio intervals without confint()”: http://www.chrisbilder.com/categorical/Chapter2/Placekick.R

Testing Composite Hypotheses about Fixed Effects (Ch 4 of ALDA)

I’m still on my ALDA kick here, this time posting about section 4.7 of Chapter 4. In my last post I talked deviance-based hypothesis tests in the context of model building. Recall you have to be aware of which method of estimation you used when you deploy the deviance statistic. Namely, if you used Restricted Maximum Likelihood (RML) to estimate model parameters, you can only use the deviance statistic to test hypotheses about variance components. This is an important point as many programs default to RML, such as the lme4 package in R and SAS PROC MIXED. But the deviance statistic is not the only tool for testing composite hypotheses.

That leads us to section 4.7 and the Wald statistic. The Wald statistic allows you to “test composite hypotheses about multiple effects regardless of the method of estimation used. This means that if you use restricted methods of estimation, which prevent you from using deviance-based tests to compare models with different fixed effects, you still have a means of testing composite hypotheses about sets of fixed effects.” (p. 122) Sounds good to me!

The authors give two examples, one of which I want to review in this post. As usual they don’t show you how to do the test using statistical software. Unfortunately either does the UCLA stats consulting page for ALDA. So I had to figure it out on my own.

Let’s reset the example study motivating the work in this chapter. 82 adolescents were surveyed on alcohol use. Some of the variables collected included:

  • alcuse, a rating-scale measure of alcohol use (the response)
  • age_14, age of participant centered about 14 (14 = 0, 15 = 1, 16 = 2)
  • coa, an indicator whether or not participant is a child of an alcoholic (1 = yes, 0 = no)
  • cpeer, a rating-scale measure of alcohol use among peers centered on its sample mean of 1.018
  • id, an arbitrary level to group persons

These variables are part of Model F, the model of interest in section 4.7, which aims to explain the variability in alcuse. (Models A through E precede Model F in the earlier model-building portion of the chapter.) Here’s Model F in its multilevel form:

level 1
Y_{ij} = \pi_{0i} + \pi_{1i}*age14_{ij} + \epsilon_{ij}

level 2
\pi_{0i} = \gamma_{00} + \gamma_{01}*coa + \gamma_{02}*cpeer + \zeta_{0i}
\pi_{1i} = \gamma_{10} + \gamma_{12}*cpeer + \zeta_{1i}

So this model posits that individuals have a liner trajectory over time (level 1), and that the parameters themselves of that linear trajectory differ between individuals based on coa and cpeer (level 2).

We can combine the two levels into one scary-looking composite representation of the model:
Y_{ij} = \gamma_{00} + \gamma_{01}*coa + \gamma_{02}*peer + \gamma_{10}*age14 + \gamma_{12}*peer*age14 + \zeta_{0i} + \zeta_{1i}*age14 + \epsilon_{ij}

Then we can estimate the parameters of that model in R with the following code:

alcohol1 <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", 
                       header=T, sep=",")
attach(alcohol1)
library(lme)
model.f1 <- lmer(alcuse ~ coa + cpeer*age_14 + (age_14 | id), alcohol1, REML = FALSE)
summary(model.f1)

And now we are ready to test composite hypotheses about this model. The first example in the book asks whether the average child of non-alcoholic parents - with an average value of peer - drinks no alcohol at age 14 (intercept = 0) and remains abstinent over time (slope = 0). So we set coa and cpeer both equal to 0 in our composite model and we're left with this:

Y_{ij} = \gamma_{00} + \gamma_{10}*age14 + \zeta_{0i} + \zeta_{1i}*age14 + \epsilon_{ij}

Thus our question is essentially asking if the slope and intercept in this model are 0. Or to state it formally, our composite null hypothesis is as follows:

H_{0}: \gamma_{00} = 0 \: and \: \gamma_{10} = 0

Now to carry out this test, we need to express this hypothesis as a general linear hypothesis in matrix notation. First let's restate the hypothesis using our fixed effects:

1\gamma_{00} + 0\gamma_{01} + 0\gamma_{02} + 0\gamma_{10} + 0\gamma_{12} = 0
0\gamma_{00} + 0\gamma_{01} + 0\gamma_{02} + 1\gamma_{10} + 0\gamma_{12} = 0

We have weighted our coefficients so that the only two we're interested in are viable. Now we create a matrix of the weights. This is easier and faster to show in R than trying to use LaTeX, which I'm not even sure I can pull off with the Word Press plugin I'm using.

C <- matrix(c(1,0,0,0,0,0,0,0,1,0), nrow=2, byrow=TRUE)
C
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    0    0    1    0

Now using the matrix we just created we can conduct a linear hypothesis test using the linearHypothesis function available in the car package, like so:

library(car)
linearHypothesis(model.f1, C)

This returns a Wald statistic of 51.03 on 2 degrees of freedom, almost matching the book which reports 51.01. The p-value is practically 0, which means we reject this composite hypothesis.

Now it's nice to know there's an R function that will calculate the Wald statistic, but what is it? How can we find out? The following code reveals the source of linearHypothesis:

print(getAnywhere(linearHypothesis.lme))

In it we see the following calculation:

SSH <- as.vector(t(L %*% b - rhs) %*% solve(L %*% V %*% t(L)) %*% 
        (L %*% b - rhs))

That's it. So we have some matrix multiplication happening. In this calculation L is our hypothesis matrix, b is our fixed effects, rhs means "right hand side" (which in our example is 0) and V is the variance-covariance matrix of the model parameters.

If we wanted to calculate the Wald statistic by hand, we could do the following:

# extract the model coefficients
b <- matrix(summary(model.f1)@coefs[,1],nrow=5)

# create the "right-hand side"
q <- matrix(c(0,0),nrow=2)

# extract the variance-covariance matrix
V <- vcov(model.f1)

# calculate the Wald statistic
W <- t(C%*%b - q) %*% solve(C%*% V %*%t(C)) %*% (C%*%b - q)

To calculate the p-value, use the pchisq function:

pchisq(as.numeric(W),2,lower.tail=FALSE)

Explaining and simulating an F distribution

I remember feeling very uncomfortable with the F distribution when I started learning statistics. It comes up when you’re learning ANOVA, which is rather complicated if you’re learning it for the first time. You learn about partitioning the sum of squares, determining degrees of freedom, calculating mean square error and mean square between-groups, putting everything in a table and finally calculating something called an F statistic. Oh, and don’t forget the assumptions of constant variance and normal distributions (although ANOVA is pretty robust to those violations). There’s a lot to take in! And that’s before you calculate a p-value. And how do you do that? You use an F distribution, which when I learned it meant I had to flip to a massive table in the back of my stats book. Where did that table come from and why do you use it for ANOVA? I didn’t know, I didn’t care. It was near the end of the semester and I was on overload.

I suspect this happens to a lot of people. You work hard to understand the normal distribution, hypothesis testing, confidence intervals, etc. throughout the semester. Then you get to ANOVA sometime in late November (or April) and it just deflates you. You will yourself to understand the rules of doing it, but any sort of intuition for it escapes you. You just want to finish the class.

This post attempts to provide an intuitive explanation of what an F distribution is with respect to one-factor ANOVA. Maybe someone will read this and benefit from it.

ANOVA stands for ANalysis Of VAriance. We’re analyzing variances to determine if there is a difference in means between more than 2 groups. In one-factor ANOVA we’re using one factor to determine group membership. Maybe we’re studying a new drug and create three groups: group A on 10 mg, group B on 5 mg, and group C on placebo. At the end of the study we measure some critical lab value. We take the mean of that lab value for the three groups. We wish to know if there is a difference in the means between those three population groups (assuming each group contains a random sample from three very large populations).

The basic idea is to estimate two variances and form a ratio of those variances. If the variances are about the same, the ratio will be close to 1 and we have no evidence that the populations means differ based on our sample.

One variance is simply the mean of the group variances. Say you have three groups. Take the variance of each group and then find the average of those variances. That’s called the mean square error (MSE). In symbol-free mathematical speak, for three groups we have:

MSE = variance(group A) + variance(group B) + variance(group C) / 3

The other variance is the variance of the sample means multiplied by the number of items in each group (assuming equal sample sizes in each group). That’s called mean square between groups (MSB). It looks something like this:

MSB = variance(group means)*(n)

The F statistic is the ratio of these two variances: F = MSB/MSE

Now if the groups have the same means and same variance and are normally distributed, the F statistic has an F distribution. In other words, if we were to run our experiment hundreds and hundreds of times on three groups with the same mean and variance from a normal distribution, calculate the F statistic each time, and then make a histogram of all our F statistics, our histogram would have a shape that can be modeled with an F distribution.

That’s something we can do in R with the following code:

# generate 4000 F statistics for 5 groups with equal means and variances
R <- 4000
Fstat <- vector(length = R)
for (i in 1:R){
  g1 <- rnorm(20,10,4)
  g2 <- rnorm(20,10,4)
  g3 <- rnorm(20,10,4)
  g4 <- rnorm(20,10,4)
  g5 <- rnorm(20,10,4)
  mse <- (var(g1)+var(g2)+var(g3)+var(g4)+var(g5))/5
  M <- (mean(g1)+mean(g2)+mean(g3)+mean(g4)+mean(g5))/5
  msb <- ((((mean(g1)-M)^2)+((mean(g2)-M)^2)+((mean(g3)-M)^2)+((mean(g4)-M)^2)+((mean(g5)-M)^2))/4)*20
  Fstat[i] <- msb/mse
}
# plot a histogram of F statistics and superimpose F distribution (4, 95)
ylim <- (range(0, 0.8))
x <- seq(0,6,0.01)
hist(Fstat,freq=FALSE, ylim=ylim)
curve(df(x,4,95),add=T) # 5 - 1 = 4; 100 - 5 = 95

So I have 5 normally distributed groups each with a mean of 10 and standard deviation of 4. I take 20 random samples from each and calculate the MSE and MSB as I outlined above. I then calculate the F statistic. And I repeat 4000 times. Finally I plot a histogram of my F statistics and super-impose a theoretical F distribution on top of it. I get the resulting figure:

This is the distribution of the F statistic when the groups samples come from Normal populations with identical means and variances. The smooth curve is an F distribution with 4 and 95 degrees of freedom. The 4 is Number of Groups - 1 (or 5 - 1). The 95 is from Total Number of Observations - Number of Groups (or 100 - 5). If we examine the figure we see that we most likely get an F statistic around 1. The bulk of the area under the curve is between 0.5 and 1.5. This is the distribution we would use to find a p-value for any experiment that involved 5 groups of 20 members each. When the populations means of the groups are different, we get an F statistic greater than 1. The bigger the differences, the larger the F statistic. The larger the F statistic, the more confident we are that the population means between the 5 groups are not the same.

There is much, much more that can be said about ANOVA. I'm not even scratching the surface. But what I wanted to show here is a visual of the distribution of the F statistic. Hopefully it'll give you a better idea of what exactly an F distribution is when you're learning ANOVA for the first time.