Category Archives: Hypothesis Testing

Hypothesis testing: section 4.6 of Regression and Other Stories

Gelman, et al. tell a story from long ago where someone sent them a fax (that’s right, a fax) asking for help with suspected voter fraud. The story is in section 4.6 (page 63) and is included to provide an example of constructing a hypothesis test. They provide data and code for this example in the Coop folder on Github. The point of this post is to document some changes I made to the code to help me understand it.

The story involves the election of a board of directors for a “residential organization”. 5553 people were allowed to vote for up to 6 people. 27 candidates were running for the board. Votes were tallied after 600 people voted, then again at 1200, 2444, 3444, 4444, and the end after all 5553 people voted. What aroused suspicion was the fact that the proportion of votes for the candidates remained steady each time the votes were tallied. According to the author of the fax: “the election was rigged…[it] is a fixed vote with fixed percentages being assigned to each and every candidate making it impossible to participate in an honest election.”

Let’s read in the data and demonstrate what they’re talking about. Notice this data is the rare CSV without column headers. The data consists of 27 rows, one for each candidate, showing cumulative vote totals.

data <- read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Coop/data/Riverbay.csv", 
                 header = FALSE)
# drop 1st and 8th columns; contain candidate names which we don't need.
votes <- data[,2:7]
head(votes)
##    V2  V3  V4   V5   V6   V7
## 1 208 416 867 1259 1610 2020
## 2  55 106 215  313  401  505
## 3 133 250 505  716  902 1129
## 4 101 202 406  589  787  976
## 5 108 249 512  745  970 1192
## 6  54  94 196  279  360  451

Now let’s calculate the proportion of votes received at each interval and create a basic line plot. Each line below represents proportion of votes received for a candidate at each of the six intervals. Notice how the lines are mostly flat. This is what prompted the emergency fax.

vote_p <- apply(votes, 2, proportions)
matplot(t(vote_p), type = "l", col = 1, lty = 1)

Gelman, et al. demonstrate this using separate plots for the top 8 vote-getters (Fig 4.5). They also divide by number of voters instead of total votes received. (Remember, each voter gets to vote for up to six people.) This simply changes the denominator, and hence, the y-axis. The steady vote patterns remain.

voters <- c(600,1200,2444,3444,4444,5553)
vote_p <- sweep(votes, 2, voters, FUN = "/")
matplot(t(vote_p), type = "l", col = 1, lty = 1)

They note that the data in this plot is not independent since proportions at times 2 and beyond include votes that came before. To address this, they create a matrix that contains number of votes received at each interval instead of cumulative totals.

interval_votes <- t(apply(votes, 1, diff))
interval_votes <- cbind(votes[,1], interval_votes)
head(interval_votes)
##           V3  V4  V5  V6  V7
## [1,] 208 208 451 392 351 410
## [2,]  55  51 109  98  88 104
## [3,] 133 117 255 211 186 227
## [4,] 101 101 204 183 198 189
## [5,] 108 141 263 233 225 222
## [6,]  54  40 102  83  81  91

After taking differences the lines still seem mostly stable.

interval_p <- apply(interval_votes, 2, proportions)
matplot(t(interval_p), type = "l", col = 1, lty = 1)

Again, the authors divide by number of voters instead of total votes to create these plots, but the result is the same with a different y-axis. Here’s how I would do the calculations and create the plot.

interval_voters <- c(600, diff(voters))
interval_p <- sweep(interval_votes, 2, interval_voters, FUN = "/")
matplot(t(interval_p), type = "l", col = 1, lty = 1)

And now comes the hypothesis test. What is the probability of seeing steady proportions like this if the votes really were coming in at random? I’ll quote the book here: “Because the concern was that the votes were unexpectedly stable as the count proceeded, we define a test statistic to summarize variability.” The test statistic in this case is the standard deviations of the sample proportions. We can quickly get these from the interval_p object we created above.

test_stat <- apply(interval_p, 1, sd)

Now we need to calculate the theoretical test statistic. For this we assume each candidate has a fixed but unknown proportion of voters who will vote for them, \(\pi_i\). Under the null, the six intervals where votes are tallied are random samples of the voters. So at each time point we can think of the proportion as a draw from a distribution with mean \(\pi_i\) and standard deviation \(\sqrt{\pi_i(1 – \pi_i)/n_t}\), where \(n_t\) is the number of voters at each interval. To calculate this, we first need to estimate \(\pi_i\) with \(p_i\), the observed proportion of votes each candidate received. This is the last column of the votes data frame divided by the total number of voters, 5553.

p_hat <- votes[,6]/5553

Then we take the average of the variances calculated at each time point and take the square root to get the theoretical test statistic.

theory_test_stat <- sapply(p_hat, function(x)sqrt(mean(x*(1-x)/interval_voters)))

Under the null, the observed test statistics should be very close to the theoretical test statistics. This is assessed in Fig 4.7 in the book. I replicate the plot as follows:

plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
     ylab = "sd of separate vote proportions")
points(x = votes[,6], y = theory_test_stat, pch = 19)

The authors note that “the actual standard deviations appear consistent with the theoretical model.”

Personally I think the plot would be a little more effective if they zoomed out a little. Some of the dramatic looking departures are only off by 0.01. For example:

plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
     ylab = "sd of separate vote proportions", ylim = c(0,0.05))
points(x = votes[,6], y = theory_test_stat, pch = 19)

Another null hypothesis approach is the chi-square test of association. Under the null, the number of votes is not associated with the interval when votes were tallied. We can run this test for each candidate and look at the p-values. If there is no association for each candidate we should see a fairly uniform scatter of p-values. On the other hand, if there was “suspiciously little variation over time” we would see a surplus of high p-values. Here’s how I carried out these calculations. I first created the 2-way tables of yes/no versus time for each candidate. I then applied the chi-square test to each table, and to that result, I extracted each p-value. A uniform QQ plot shows the p-values are mostly uniformly distributed.

tables <- apply(interval_votes, 1, function(x) rbind(x, interval_voters - x), 
      simplify = FALSE)
chisq_out <- lapply(tables, chisq.test, correct = FALSE)
p_values <- sapply(chisq_out, function(x)x$p.value)
qqplot(ppoints(27), p_values)
qqline(p_values, distribution = qunif)

Finally the authors mention that a single test on the entire 27 x 6 table could be performed. This seems like the easiest approach of all.

chisq.test(interval_votes, correct = F)
## 
##  Pearson's Chi-squared test
## 
## data:  interval_votes
## X-squared = 114.72, df = 130, p-value = 0.8279

My R code differs quite a bit from the R code provided by the authors. I’m not saying mine is better, it just makes more sense to me. Maybe someone else will find this approach useful.

Parametric Bootstrap of Kolmogorov–Smirnov Test

Zeimbekakis, et al. recently published an article in The American Statistician titled On Misuses of the Kolmogorov–Smirnov Test for One-Sample Goodness-of-Fit. One of the misues they discuss is using the KS test with parameters estimated from the sample. For example, let’s sample some data from a normal distribution.

x <- rnorm(200, mean = 8, sd = 8)
c(xbar = mean(x), s = sd(x))
##     xbar        s 
## 8.333385 7.979586

If we wanted to assess the goodness-of-fit of this sample to a normal distribution, the following is a bad way to use the KS test:

ks.test(x, "pnorm", mean(x), sd(x))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.040561, p-value = 0.8972
## alternative hypothesis: two-sided

The appropriate way to use the KS test is to actually supply hypothesized parameters. For example:

ks.test(x, "pnorm", 8, 8)
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.034639, p-value = 0.9701
## alternative hypothesis: two-sided

The results of both tests are the same. We fail to reject the null hypothesis that the sample is from a Normal distribution with the stated mean and standard deviation. However, the former test is very conservative. Zeimbekakis, et al. show this via simulation. I show a simplified version of this simulation. The basic idea is that if the test were valid, the p-values would be uniformly distributed and the points in the uniform distribution QQ-plot would fall along a diagonal line. Clearly that’s not the case.

n <- 200
rout <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  xbar <- mean(x)
  s <- sd(x)
  ks.test(x, "pnorm", xbar, s)$p.value
})
hist(rout, main = "Histogram of p-values")

qqplot(x = ppoints(n), y = rout, main = "Uniform QQ-plot")
qqline(rout, distribution = qunif)

Conclusion: using fitted parameters in place of the true parameters in the KS test yields conservative results. The authors state in the abstract that this “has been ‘discovered’ multiple times.”

When done the right way, the KS test yields uniformly distributed p-values.

rout2 <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  ks.test(x, "pnorm", 8, 8)$p.value
})
hist(rout2)

qqplot(x = ppoints(n), y = rout2, main = "Uniform QQ-plot")
qqline(rout2, distribution = qunif)

Obviously it’s difficult to know which parameters to supply to the KS test. Above we knew to supply 8 as the mean and standard deviation because that’s what we used to generate the data. But what to do in real life? Zeimbekakis, et al. propose a parametric bootstrap to approximate the null distribution of the KS test statistic. The steps to implement the bootstrap are as follows:

  1. draw a random sample from the fitted distribution
  2. get estimates of parameters of random sample
  3. obtain the empirical distribution function
  4. calculate the bootstrapped KS statistic
  5. repeat steps 1 – 4 many times

Let’s do it. The following code is a simplified version of what the authors provide with the paper. Notice they use MASS::fitdistr() to obtain MLE parameter estimates. This returns the same mean for the normal distribution but a slightly smaller (i.e. biased) estimated standard deviation.

param  <- MASS::fitdistr(x, "normal")$estimate
ks <- ks.test(x, function(x)pnorm(x, param[1], param[2]))
stat <- ks$statistic
B <- 1000
stat.b <- double(B)
n <- length(x)

## bootstrapping
for (i in 1:B) {
  # (1) draw a random sample from a fitted dist
  x.b <- rnorm(n, param[1], param[2])
  # (2) get estimates of parameters of random sample
  fitted.b <- MASS::fitdistr(x.b, "normal")$estimate
  # (3) get empirical distribution function
  Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2])
  # (4) calculate bootstrap KS statistic
  stat.b[i] <- ks.test(x.b, Fn)$statistic
}
mean(stat.b >= stat)
## [1] 0.61

The p-value is the proportion of statistics greater than or equal to the observed statistic calculated with estimated parameters.

Let’s turn this into a function and show that it returns uniformly distributed p-values when used with multiple samples. Again this is a simplified version of the R code the authors generously shared with their paper.

ks.boot <- function(x, B = 1000){
  param  <- MASS::fitdistr(x, "normal")$estimate
  ks <- ks.test(x, function(k)pnorm(k, param[1], param[2]))
  stat <- ks$statistic
  stat.b <- double(B)
  n <- length(x)
  for (i in 1:B) {
    x.b <- rnorm(n, param[1], param[2])
    fitted.b <- MASS::fitdistr(x.b, "normal")$estimate
    Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2])
    stat.b[i] <- ks.test(x.b, Fn)$statistic
  }
  mean(stat.b >= stat)
}

Now replicate the function with many samples. This takes a moment to run. It took my Windows 11 PC with an Intel i7 chip about 100 seconds to run.

rout_boot <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  ks.boot(x)
})
hist(rout_boot)

qqplot(x = ppoints(n), y = rout_boot, main = "Uniform QQ-plot")
qqline(rout_boot, distribution = qunif)

Most useful tests for an ANCOVA model

In his book Regression Modeling Strategies, 2nd Ed, Frank Harrell provides a list of what he calls the “most useful tests” for a 2-level factor \(\times\) numeric model (Table 2.2, p. 19). This is often called an Analysis of Covariance, or ANCOVA. The basic idea is we have a numeric response with substantial variability and we seek to understand the variability by modeling the mean of the response as a function of a categorical variable and a numeric variable.

Let’s simulate some data for such a model and then see how we can use R to carry out these tests.

n <- 400
set.seed(1)
sex <- factor(sample(x = c("f", "m"), size = n, replace = TRUE))
age <- round(runif(n = n, min = 18, max = 65))
y <- 1 + 0.8*age + 0.4*(sex == "m") - 0.7*age*(sex == "m") + rnorm(n, mean = 0, sd = 8)
dat <- data.frame(y, age, sex)

The data contain a numeric response, y, that is a function of age and sex. I set the “true” coefficient values to 1, 0.8, 0.4, and -0.7. They correspond to \(\beta_0\) through \(\beta_3\) in the following model:

\[y = \beta_0 + \beta_1 age + \beta_2 sex + \beta_3 age \times sex\]

In addition the error component is a Normal distribution with a standard deviation of 8.

Now let’s model the data and see how close we get to recovering the true parameter values.

mod <- lm(y ~ age * sex, dat)
summary(mod)
## 
## Call:
## lm(formula = y ~ age * sex, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.8986  -5.8552  -0.2503   6.0507  30.6188 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.27268    1.93776   0.141    0.888    
## age          0.79781    0.04316  18.484   <2e-16 ***
## sexm         2.07143    2.84931   0.727    0.468    
## age:sexm    -0.72702    0.06462 -11.251   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.661 on 396 degrees of freedom
## Multiple R-squared:  0.7874, Adjusted R-squared:  0.7858 
## F-statistic:   489 on 3 and 396 DF,  p-value: < 2.2e-16

While the coefficient estimates for age and the age \(\times\) sex interaction are pretty close to the true values, the same cannot be said for the intercept and sex coefficients. The residual standard error of 8.661 is close to the true value of 8.

We can see in the summary output of the model that four hypothesis tests, one for each coefficient, are carried out for us. Each are testing if the coefficient is equal to 0. Of those four, only one qualifies as one of the most useful tests: the last one for age:sexm. This tests if the effect of age is independent of sex and vice versa. Stated two other ways, it tests if age and sex are additive, or if the age effect is the same for both sexes. To get a better understanding of what we’re testing, let’s plot the data with fitted age slopes for each sex.

library(ggplot2)
ggplot(dat, aes(x = age, y = y, color = sex)) +
  geom_point() +
  geom_smooth(method="lm")

plot of chunk unnamed-chunk-3

Visually it appears the effect of age is not independent of sex. It seems more pronounced for females. Is this effect real or maybe due to chance? The hypothesis test in the summary output for age:sexm evaluates this. Obviously the effect seems very real. We are not likely to see such a difference in slopes this large if there truly was no difference. It does appear the effect of age is different for each sex. The estimate of -0.72 estimates the difference in slopes (or age effect) for the males and females.

The other three hypothesis tests are not very useful.

  • Testing if the Intercept is 0 is testing whether y is 0 for females at age 0.
  • Testing if age is 0 is testing whether age is associated with y for males.
  • Testing if sexm is 0 is testing whether sex is associated with y for subjects at age 0.

Other more useful tests, as Harrell outlines in Table 2.2, are as follows:

  • Is age associated with y?
  • Is sex associated with y?
  • Are either age or sex associated with y?

The last one is answered in the model output. That’s the F-statistic in the last line. It tests whether all coefficients (except the intercept) are equal to 0. The result of this test is conclusive. At least one of the coeffcients is not 0.

To test if age is associated with y, we need to test if both the age and age:sexm coefficents are equal to 0. The car package by John Fox provides a nice function for this purpose called linearHypothesis. It takes at least two arguments. The first is the fitted model object and the second is a vector of hypothesis tests. Below we specify we want to test if “age = 0” and “age:sexm = 0”

library(car)
linearHypothesis(mod, c("age = 0", "age:sexm = 0"))
## Linear hypothesis test
## 
## Hypothesis:
## age = 0
## age:sexm = 0
## 
## Model 1: restricted model
## Model 2: y ~ age * sex
## 
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    398 55494                                  
## 2    396 29704  2     25790 171.91 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result is once again conclusive. The p-value is virtually 0. It does indeed appear that age is associated with y.

Likewise, to test if sex is associated with y, we need to test if both the sex and age:sexm coefficents are equal to 0.

linearHypothesis(mod, c("sexm = 0", "age:sexm = 0"))
## Linear hypothesis test
## 
## Hypothesis:
## sexm = 0
## age:sexm = 0
## 
## Model 1: restricted model
## Model 2: y ~ age * sex
## 
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    398 119354                                 
## 2    396  29704  2     89651 597.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As expected this test confirms that sex is associated with y, just as we specified when we simulated the data.

Now that we have established that age is associated with y, and that the association differs for each sex, what exactly is that association for each sex? In other words what are the slopes of the lines in our plot above?

We can sort of answer that with the model coefficients.

round(coef(mod),3)
## (Intercept)         age        sexm    age:sexm 
##       0.273       0.798       2.071      -0.727

That corresponds to the following model:

\[y = 0.273 + 0.799 age + 2.071 sex – 0.727 age \times sex\]

When sex is female, the fitted model is

\[y = 0.273 + 0.799 age \]

This says the slope of the age is about 0.8 when sex is female.

When sex is male, the fitted model is

\[y = (0.273 + 2.071) + (0.797 – 0.727) age \]
\[y = 2.344 + 0.07 age \]

This says the slope of the age is about 0.07 when sex is male.

How certain are we about these estimates? That’s what standard error is for. For the age slope estimate for females the standard error is provided in the model output for the age coefficient. It shows about 0.04. Adding and subtracting 2 \(\times\) 0.04 to the coefficient gives us a rough 95% confidence interval. Or we could just use the confint function:

confint(mod, parm = "age")
##         2.5 %    97.5 %
## age 0.7129564 0.8826672

The standard error of the age slope estimate for males takes a little more work. Another car function useful for this is the deltaMethod function. It takes at least three arguments: the model object, the quantity expressed as a character phrase that we wish to estimate a standard error for, and the names of the parameters. The function then calculates the standard error using the delta method. Here’s one way to do it for our model

deltaMethod(mod, "b1 + b3", parameterNames = paste0("b", 0:3)) 
##           Estimate         SE       2.5 %    97.5 %
## b1 + b3 0.07079277 0.04808754 -0.02345709 0.1650426

The standard error is similar in magnitude, but since our estimate is so small the resulting confidence interval overlaps 0. This tells us the effect of age on males is too small for our data to determine if the effect is positive or negative.

Another way to get the estimated age slopes for each sex, along with standard errors and confidence intervals, is to use the margins package. We use the margins function with our model object and specify that we want to estimate the marginal effect of age at each level of sex. (“marginal effect of age” is another way of saying the effect of age at each level of sex)

library(margins)
margins(mod, variables = "age", at = list(sex = c("f", "m")))
## Average marginal effects at specified values
## lm(formula = y ~ age * sex, data = dat)
##  at(sex)     age
##        f 0.79781
##        m 0.07079

This does the formula work we did above. It plugs in sex and returns the estmimated slope coefficient for age. If we wrap the call in summary we get the standard errors and confidence intervals.

summary(margins(mod, variables = "age", at = list(sex = c("f", "m"))))
##  factor    sex    AME     SE       z      p   lower  upper
##     age 1.0000 0.7978 0.0432 18.4841 0.0000  0.7132 0.8824
##     age 2.0000 0.0708 0.0481  1.4722 0.1410 -0.0235 0.1650

Profile likelihood ratio confidence intervals

When you fit a generalized linear model (GLM) in R and call confint on the model object, you get confidence intervals for the model coefficients. But you also get an interesting message:

Waiting for profiling to be done...

What's that all about? What exactly is being profiled? Put simply, it's telling you that it's calculating a profile likelihood ratio confidence interval.

The typical way to calculate a 95% confidence interval is to multiply the standard error of an estimate by some normal quantile such as 1.96 and add/subtract that product to/from the estimate to get an interval. In the context of GLMs, we sometimes call that a Wald confidence interval.

Another way to determine an upper and lower bound of plausible values for a model coefficient is to find the minimum and maximum value of the set of all coefficients that satisfy the following:

\[-2\log\left(\frac{L(\beta_{0}, \beta_{1}|y_{1},…,y_{n})}{L(\hat{\beta_{0}}, \hat{\beta_{1}}|y_{1},…,y_{n})}\right) < \chi_{1,1-\alpha}^{2}\]

Inside the parentheses is a ratio of likelihoods. In the denominator is the likelihood of the model we fit. In the numerator is the likelihood of the same model but with different coefficients. (More on that in a moment.) We take the log of the ratio and multiply by -2. This gives us a likelihood ratio test (LRT) statistic. This statistic is typically used to test whether a coefficient is equal to some value, such as 0, with the null likelihood in the numerator (model without coefficient, that is, equal to 0) and the alternative or estimated likelihood in the denominator (model with coefficient). If the LRT statistic is less than \(\chi_{1,0.95}^{2} \approx 3.84\), we fail to reject the null. The coefficient is statisically not much different from 0. That means the likelihood ratio is close to 1. The likelihood of the model without the coefficient is almost as high the model with it. On the other hand, if the ratio is small, that means the likelihood of the model without the coefficient is much smaller than the likelihood of the model with the coefficient. This leads to a larger LRT statistic since it's being log transformed, which leads to a value larger than 3.84 and thus rejection of the null.

Now in the formula above, we are seeking all such coefficients in the numerator that would make it a true statement. You might say we're “profiling” many different null values and their respective LRT test statistics. Do they fit the profile of a plausible coefficient value in our model? The smallest value we can get without violating the condition becomes our lower bound, and likewise with the largest value. When we're done we'll have a range of plausible values for our model coefficient that gives us some indication of the uncertainly of our estimate.

Let's load some data and fit a binomial GLM to illustrate these concepts. The following R code comes from the help page for confint.glm. This is an example from the classic Modern Applied Statistics with S. ldose is a dosing level and sex is self-explanatory. SF is number of successes and failures, where success is number of dead worms. We're interested in learning about the effects of dosing level and sex on number of worms killed. Presumably this worm is a pest of some sort.

# example from Venables and Ripley (2002, pp. 190-2.)
ldose <- rep(0:5, 2)
numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
sex <- factor(rep(c("M", "F"), c(6, 6)))
SF <- cbind(numdead, numalive = 20-numdead)
budworm.lg <- glm(SF ~ sex + ldose, family = binomial)
summary(budworm.lg)
## 
## Call:
## glm(formula = SF ~ sex + ldose, family = binomial)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.10540  -0.65343  -0.02225   0.48471   1.42944  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4732     0.4685  -7.413 1.23e-13 ***
## sexM          1.1007     0.3558   3.093  0.00198 ** 
## ldose         1.0642     0.1311   8.119 4.70e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 124.8756  on 11  degrees of freedom
## Residual deviance:   6.7571  on  9  degrees of freedom
## AIC: 42.867
## 
## Number of Fisher Scoring iterations: 4

The coefficient for ldose looks significant. Let's determine a confidence interval for the coefficient using the confint function. We call confint on our model object, budworm.lg and use the parm argument to specify that we only want to do it for ldose:

confint(budworm.lg, parm = "ldose")
## Waiting for profiling to be done...
##     2.5 %    97.5 % 
## 0.8228708 1.3390581

We get our “waiting” message though there really was no wait. If we fit a larger model and request multiple confidence intervals, then there might actually be a waiting period of a few seconds. The lower bound is about 0.8 and the upper bound about 1.32. We might say every increase in dosing level increase the log odds of killing worms by at least 0.8. We could also exponentiate to get a CI for an odds ratio estimate:

exp(confint(budworm.lg, parm = "ldose"))
## Waiting for profiling to be done...
##    2.5 %   97.5 % 
## 2.277027 3.815448

The odds of “success” (killing worms) is at least 2.3 times higher at one dosing level versus the next lower dosing level.

To better understand the profile likelihood ratio confidence interval, let's do it “manually”. Recall the denominator in the formula above was the likelihood of our fitted model. We can extract that with the logLik function:

den <- logLik(budworm.lg)
den
## 'log Lik.' -18.43373 (df=3)

The numerator was the likelihood of a model with a different coefficient. Here's the likelihood of a model with a coefficient of 1.05:

num <- logLik(glm(SF ~ sex + offset(1.05*ldose), family = binomial))
num
## 'log Lik.' -18.43965 (df=2)

Notice we used the offset function. That allows us to fix the coefficient to 1.05 and not have it estimated.

Since we already extracted the log likelihoods, we need to subtract them. Remember this rule from algebra?

\[\log\frac{M}{N} = \log M – \log N\]

So we subtract the denominator from the numerator, multiply by -2, and check if it's less than 3.84, which we calculate with qchisq(p = 0.95, df = 1)

-2*(num - den)
## 'log Lik.' 0.01184421 (df=2)
-2*(num - den) < qchisq(p = 0.95, df = 1)
## [1] TRUE

It is. 1.05 seems like a plausible value for the ldose coefficient. That makes sense since the estimated value was 1.0642. Let's try it with a larger value, like 1.5:

num <- logLik(glm(SF ~ sex + offset(1.5*ldose), family = binomial))
-2*(num - den) < qchisq(p = 0.95, df = 1)
## [1] FALSE

FALSE. 1.5 seems too big to be a plausible value for the ldose coefficient.

Now that we have the general idea, we can program a while loop to check different values until we exceed our threshold of 3.84.

cf <- budworm.lg$coefficients[3]  # fitted coefficient 1.0642
cut <- qchisq(p = 0.95, df = 1)   # about 3.84
e <- 0.001                        # increment to add to coefficient
LR <- 0                           # to kick start our while loop 
while(LR < cut){
  cf <- cf + e
  num <- logLik(glm(SF ~ sex + offset(cf*ldose), family = binomial))
  LR <- -2*(num - den)
}
(upper <- cf)
##    ldose 
## 1.339214

To begin we save the original coefficient to cf, store the cutoff value to cut, define our increment of 0.001 as e, and set LR to an initial value of 0. In the loop we increment our coefficient estimate which is used in the offset function in the estimation step. There we extract the log likelihood and then calculate LR. If LR is less than cut (3.84), the loop starts again with a new coefficient that is 0.001 higher. We see that our upper bound of 1.339214 is very close to what we got above using confint (1.3390581). If we set e to smaller values we'll get closer.

We can find the LR profile lower bound in a similar way. Instead of adding the increment we subtract it:

cf <- budworm.lg$coefficients[3]  # reset cf
LR <- 0                           # reset LR 
while(LR < cut){
  cf <- cf - e
  num <- logLik(glm(SF ~ sex + offset(cf*ldose), family = binomial))
  LR <- -2*(num - den)
}
(lower <- cf)
##    ldose 
## 0.822214

The result, 0.822214, is very close to the lower bound we got from confint (0.8228708).

This is a very basic implementation of calculating a likelihood ratio confidence interval. It is only meant to give a general sense of what's happening when you see that message Waiting for profiling to be done.... I hope you found it helpful. To see how R does it, enter getAnywhere(profile.glm) in the console and inspect the code. It's not for the faint of heart.

I have to mention the book Analysis of Categorical Data with R, from which I gained a better understanding of the material in this post. The authors have kindly shared their R code at the following web site if you want to have a look: http://www.chrisbilder.com/categorical/

To see how they “manually” calculate likelihood ratio confidence intervals, go to the following R script and see the section “Examples of how to find profile likelihood ratio intervals without confint()”: http://www.chrisbilder.com/categorical/Chapter2/Placekick.R

Testing Composite Hypotheses about Fixed Effects (Ch 4 of ALDA)

I’m still on my ALDA kick here, this time posting about section 4.7 of Chapter 4. In my last post I talked deviance-based hypothesis tests in the context of model building. Recall you have to be aware of which method of estimation you used when you deploy the deviance statistic. Namely, if you used Restricted Maximum Likelihood (RML) to estimate model parameters, you can only use the deviance statistic to test hypotheses about variance components. This is an important point as many programs default to RML, such as the lme4 package in R and SAS PROC MIXED. But the deviance statistic is not the only tool for testing composite hypotheses.

That leads us to section 4.7 and the Wald statistic. The Wald statistic allows you to “test composite hypotheses about multiple effects regardless of the method of estimation used. This means that if you use restricted methods of estimation, which prevent you from using deviance-based tests to compare models with different fixed effects, you still have a means of testing composite hypotheses about sets of fixed effects.” (p. 122) Sounds good to me!

The authors give two examples, one of which I want to review in this post. As usual they don’t show you how to do the test using statistical software. Unfortunately either does the UCLA stats consulting page for ALDA. So I had to figure it out on my own.

Let’s reset the example study motivating the work in this chapter. 82 adolescents were surveyed on alcohol use. Some of the variables collected included:

  • alcuse, a rating-scale measure of alcohol use (the response)
  • age_14, age of participant centered about 14 (14 = 0, 15 = 1, 16 = 2)
  • coa, an indicator whether or not participant is a child of an alcoholic (1 = yes, 0 = no)
  • cpeer, a rating-scale measure of alcohol use among peers centered on its sample mean of 1.018
  • id, an arbitrary level to group persons

These variables are part of Model F, the model of interest in section 4.7, which aims to explain the variability in alcuse. (Models A through E precede Model F in the earlier model-building portion of the chapter.) Here’s Model F in its multilevel form:

level 1
Y_{ij} = \pi_{0i} + \pi_{1i}*age14_{ij} + \epsilon_{ij}

level 2
\pi_{0i} = \gamma_{00} + \gamma_{01}*coa + \gamma_{02}*cpeer + \zeta_{0i}
\pi_{1i} = \gamma_{10} + \gamma_{12}*cpeer + \zeta_{1i}

So this model posits that individuals have a liner trajectory over time (level 1), and that the parameters themselves of that linear trajectory differ between individuals based on coa and cpeer (level 2).

We can combine the two levels into one scary-looking composite representation of the model:
Y_{ij} = \gamma_{00} + \gamma_{01}*coa + \gamma_{02}*peer + \gamma_{10}*age14 + \gamma_{12}*peer*age14 + \zeta_{0i} + \zeta_{1i}*age14 + \epsilon_{ij}

Then we can estimate the parameters of that model in R with the following code:

alcohol1 <- read.table("http://www.ats.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", 
                       header=T, sep=",")
attach(alcohol1)
library(lme)
model.f1 <- lmer(alcuse ~ coa + cpeer*age_14 + (age_14 | id), alcohol1, REML = FALSE)
summary(model.f1)

And now we are ready to test composite hypotheses about this model. The first example in the book asks whether the average child of non-alcoholic parents - with an average value of peer - drinks no alcohol at age 14 (intercept = 0) and remains abstinent over time (slope = 0). So we set coa and cpeer both equal to 0 in our composite model and we're left with this:

Y_{ij} = \gamma_{00} + \gamma_{10}*age14 + \zeta_{0i} + \zeta_{1i}*age14 + \epsilon_{ij}

Thus our question is essentially asking if the slope and intercept in this model are 0. Or to state it formally, our composite null hypothesis is as follows:

H_{0}: \gamma_{00} = 0 \: and \: \gamma_{10} = 0

Now to carry out this test, we need to express this hypothesis as a general linear hypothesis in matrix notation. First let's restate the hypothesis using our fixed effects:

1\gamma_{00} + 0\gamma_{01} + 0\gamma_{02} + 0\gamma_{10} + 0\gamma_{12} = 0
0\gamma_{00} + 0\gamma_{01} + 0\gamma_{02} + 1\gamma_{10} + 0\gamma_{12} = 0

We have weighted our coefficients so that the only two we're interested in are viable. Now we create a matrix of the weights. This is easier and faster to show in R than trying to use LaTeX, which I'm not even sure I can pull off with the Word Press plugin I'm using.

C <- matrix(c(1,0,0,0,0,0,0,0,1,0), nrow=2, byrow=TRUE)
C
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    0    0    1    0

Now using the matrix we just created we can conduct a linear hypothesis test using the linearHypothesis function available in the car package, like so:

library(car)
linearHypothesis(model.f1, C)

This returns a Wald statistic of 51.03 on 2 degrees of freedom, almost matching the book which reports 51.01. The p-value is practically 0, which means we reject this composite hypothesis.

Now it's nice to know there's an R function that will calculate the Wald statistic, but what is it? How can we find out? The following code reveals the source of linearHypothesis:

print(getAnywhere(linearHypothesis.lme))

In it we see the following calculation:

SSH <- as.vector(t(L %*% b - rhs) %*% solve(L %*% V %*% t(L)) %*% 
        (L %*% b - rhs))

That's it. So we have some matrix multiplication happening. In this calculation L is our hypothesis matrix, b is our fixed effects, rhs means "right hand side" (which in our example is 0) and V is the variance-covariance matrix of the model parameters.

If we wanted to calculate the Wald statistic by hand, we could do the following:

# extract the model coefficients
b <- matrix(summary(model.f1)@coefs[,1],nrow=5)

# create the "right-hand side"
q <- matrix(c(0,0),nrow=2)

# extract the variance-covariance matrix
V <- vcov(model.f1)

# calculate the Wald statistic
W <- t(C%*%b - q) %*% solve(C%*% V %*%t(C)) %*% (C%*%b - q)

To calculate the p-value, use the pchisq function:

pchisq(as.numeric(W),2,lower.tail=FALSE)

Explaining and simulating an F distribution

I remember feeling very uncomfortable with the F distribution when I started learning statistics. It comes up when you’re learning ANOVA, which is rather complicated if you’re learning it for the first time. You learn about partitioning the sum of squares, determining degrees of freedom, calculating mean square error and mean square between-groups, putting everything in a table and finally calculating something called an F statistic. Oh, and don’t forget the assumptions of constant variance and normal distributions (although ANOVA is pretty robust to those violations). There’s a lot to take in! And that’s before you calculate a p-value. And how do you do that? You use an F distribution, which when I learned it meant I had to flip to a massive table in the back of my stats book. Where did that table come from and why do you use it for ANOVA? I didn’t know, I didn’t care. It was near the end of the semester and I was on overload.

I suspect this happens to a lot of people. You work hard to understand the normal distribution, hypothesis testing, confidence intervals, etc. throughout the semester. Then you get to ANOVA sometime in late November (or April) and it just deflates you. You will yourself to understand the rules of doing it, but any sort of intuition for it escapes you. You just want to finish the class.

This post attempts to provide an intuitive explanation of what an F distribution is with respect to one-factor ANOVA. Maybe someone will read this and benefit from it.

ANOVA stands for ANalysis Of VAriance. We’re analyzing variances to determine if there is a difference in means between more than 2 groups. In one-factor ANOVA we’re using one factor to determine group membership. Maybe we’re studying a new drug and create three groups: group A on 10 mg, group B on 5 mg, and group C on placebo. At the end of the study we measure some critical lab value. We take the mean of that lab value for the three groups. We wish to know if there is a difference in the means between those three population groups (assuming each group contains a random sample from three very large populations).

The basic idea is to estimate two variances and form a ratio of those variances. If the variances are about the same, the ratio will be close to 1 and we have no evidence that the populations means differ based on our sample.

One variance is simply the mean of the group variances. Say you have three groups. Take the variance of each group and then find the average of those variances. That’s called the mean square error (MSE). In symbol-free mathematical speak, for three groups we have:

MSE = variance(group A) + variance(group B) + variance(group C) / 3

The other variance is the variance of the sample means multiplied by the number of items in each group (assuming equal sample sizes in each group). That’s called mean square between groups (MSB). It looks something like this:

MSB = variance(group means)*(n)

The F statistic is the ratio of these two variances: F = MSB/MSE

Now if the groups have the same means and same variance and are normally distributed, the F statistic has an F distribution. In other words, if we were to run our experiment hundreds and hundreds of times on three groups with the same mean and variance from a normal distribution, calculate the F statistic each time, and then make a histogram of all our F statistics, our histogram would have a shape that can be modeled with an F distribution.

That’s something we can do in R with the following code:

# generate 4000 F statistics for 5 groups with equal means and variances
R <- 4000
Fstat <- vector(length = R)
for (i in 1:R){
  g1 <- rnorm(20,10,4)
  g2 <- rnorm(20,10,4)
  g3 <- rnorm(20,10,4)
  g4 <- rnorm(20,10,4)
  g5 <- rnorm(20,10,4)
  mse <- (var(g1)+var(g2)+var(g3)+var(g4)+var(g5))/5
  M <- (mean(g1)+mean(g2)+mean(g3)+mean(g4)+mean(g5))/5
  msb <- ((((mean(g1)-M)^2)+((mean(g2)-M)^2)+((mean(g3)-M)^2)+((mean(g4)-M)^2)+((mean(g5)-M)^2))/4)*20
  Fstat[i] <- msb/mse
}
# plot a histogram of F statistics and superimpose F distribution (4, 95)
ylim <- (range(0, 0.8))
x <- seq(0,6,0.01)
hist(Fstat,freq=FALSE, ylim=ylim)
curve(df(x,4,95),add=T) # 5 - 1 = 4; 100 - 5 = 95

So I have 5 normally distributed groups each with a mean of 10 and standard deviation of 4. I take 20 random samples from each and calculate the MSE and MSB as I outlined above. I then calculate the F statistic. And I repeat 4000 times. Finally I plot a histogram of my F statistics and super-impose a theoretical F distribution on top of it. I get the resulting figure:

This is the distribution of the F statistic when the groups samples come from Normal populations with identical means and variances. The smooth curve is an F distribution with 4 and 95 degrees of freedom. The 4 is Number of Groups - 1 (or 5 - 1). The 95 is from Total Number of Observations - Number of Groups (or 100 - 5). If we examine the figure we see that we most likely get an F statistic around 1. The bulk of the area under the curve is between 0.5 and 1.5. This is the distribution we would use to find a p-value for any experiment that involved 5 groups of 20 members each. When the populations means of the groups are different, we get an F statistic greater than 1. The bigger the differences, the larger the F statistic. The larger the F statistic, the more confident we are that the population means between the 5 groups are not the same.

There is much, much more that can be said about ANOVA. I'm not even scratching the surface. But what I wanted to show here is a visual of the distribution of the F statistic. Hopefully it'll give you a better idea of what exactly an F distribution is when you're learning ANOVA for the first time.