Tag Archives: regression and other stories

Hypothesis testing: section 4.6 of Regression and Other Stories

Gelman, et al. tell a story from long ago where someone sent them a fax (that’s right, a fax) asking for help with suspected voter fraud. The story is in section 4.6 (page 63) and is included to provide an example of constructing a hypothesis test. They provide data and code for this example in the Coop folder on Github. The point of this post is to document some changes I made to the code to help me understand it.

The story involves the election of a board of directors for a “residential organization”. 5553 people were allowed to vote for up to 6 people. 27 candidates were running for the board. Votes were tallied after 600 people voted, then again at 1200, 2444, 3444, 4444, and the end after all 5553 people voted. What aroused suspicion was the fact that the proportion of votes for the candidates remained steady each time the votes were tallied. According to the author of the fax: “the election was rigged…[it] is a fixed vote with fixed percentages being assigned to each and every candidate making it impossible to participate in an honest election.”

Let’s read in the data and demonstrate what they’re talking about. Notice this data is the rare CSV without column headers. The data consists of 27 rows, one for each candidate, showing cumulative vote totals.

data <- read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Coop/data/Riverbay.csv", 
                 header = FALSE)
# drop 1st and 8th columns; contain candidate names which we don't need.
votes <- data[,2:7]
head(votes)
##    V2  V3  V4   V5   V6   V7
## 1 208 416 867 1259 1610 2020
## 2  55 106 215  313  401  505
## 3 133 250 505  716  902 1129
## 4 101 202 406  589  787  976
## 5 108 249 512  745  970 1192
## 6  54  94 196  279  360  451

Now let’s calculate the proportion of votes received at each interval and create a basic line plot. Each line below represents proportion of votes received for a candidate at each of the six intervals. Notice how the lines are mostly flat. This is what prompted the emergency fax.

vote_p <- apply(votes, 2, proportions)
matplot(t(vote_p), type = "l", col = 1, lty = 1)

Gelman, et al. demonstrate this using separate plots for the top 8 vote-getters (Fig 4.5). They also divide by number of voters instead of total votes received. (Remember, each voter gets to vote for up to six people.) This simply changes the denominator, and hence, the y-axis. The steady vote patterns remain.

voters <- c(600,1200,2444,3444,4444,5553)
vote_p <- sweep(votes, 2, voters, FUN = "/")
matplot(t(vote_p), type = "l", col = 1, lty = 1)

They note that the data in this plot is not independent since proportions at times 2 and beyond include votes that came before. To address this, they create a matrix that contains number of votes received at each interval instead of cumulative totals.

interval_votes <- t(apply(votes, 1, diff))
interval_votes <- cbind(votes[,1], interval_votes)
head(interval_votes)
##           V3  V4  V5  V6  V7
## [1,] 208 208 451 392 351 410
## [2,]  55  51 109  98  88 104
## [3,] 133 117 255 211 186 227
## [4,] 101 101 204 183 198 189
## [5,] 108 141 263 233 225 222
## [6,]  54  40 102  83  81  91

After taking differences the lines still seem mostly stable.

interval_p <- apply(interval_votes, 2, proportions)
matplot(t(interval_p), type = "l", col = 1, lty = 1)

Again, the authors divide by number of voters instead of total votes to create these plots, but the result is the same with a different y-axis. Here’s how I would do the calculations and create the plot.

interval_voters <- c(600, diff(voters))
interval_p <- sweep(interval_votes, 2, interval_voters, FUN = "/")
matplot(t(interval_p), type = "l", col = 1, lty = 1)

And now comes the hypothesis test. What is the probability of seeing steady proportions like this if the votes really were coming in at random? I’ll quote the book here: “Because the concern was that the votes were unexpectedly stable as the count proceeded, we define a test statistic to summarize variability.” The test statistic in this case is the standard deviations of the sample proportions. We can quickly get these from the interval_p object we created above.

test_stat <- apply(interval_p, 1, sd)

Now we need to calculate the theoretical test statistic. For this we assume each candidate has a fixed but unknown proportion of voters who will vote for them, \(\pi_i\). Under the null, the six intervals where votes are tallied are random samples of the voters. So at each time point we can think of the proportion as a draw from a distribution with mean \(\pi_i\) and standard deviation \(\sqrt{\pi_i(1 – \pi_i)/n_t}\), where \(n_t\) is the number of voters at each interval. To calculate this, we first need to estimate \(\pi_i\) with \(p_i\), the observed proportion of votes each candidate received. This is the last column of the votes data frame divided by the total number of voters, 5553.

p_hat <- votes[,6]/5553

Then we take the average of the variances calculated at each time point and take the square root to get the theoretical test statistic.

theory_test_stat <- sapply(p_hat, function(x)sqrt(mean(x*(1-x)/interval_voters)))

Under the null, the observed test statistics should be very close to the theoretical test statistics. This is assessed in Fig 4.7 in the book. I replicate the plot as follows:

plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
     ylab = "sd of separate vote proportions")
points(x = votes[,6], y = theory_test_stat, pch = 19)

The authors note that “the actual standard deviations appear consistent with the theoretical model.”

Personally I think the plot would be a little more effective if they zoomed out a little. Some of the dramatic looking departures are only off by 0.01. For example:

plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
     ylab = "sd of separate vote proportions", ylim = c(0,0.05))
points(x = votes[,6], y = theory_test_stat, pch = 19)

Another null hypothesis approach is the chi-square test of association. Under the null, the number of votes is not associated with the interval when votes were tallied. We can run this test for each candidate and look at the p-values. If there is no association for each candidate we should see a fairly uniform scatter of p-values. On the other hand, if there was “suspiciously little variation over time” we would see a surplus of high p-values. Here’s how I carried out these calculations. I first created the 2-way tables of yes/no versus time for each candidate. I then applied the chi-square test to each table, and to that result, I extracted each p-value. A uniform QQ plot shows the p-values are mostly uniformly distributed.

tables <- apply(interval_votes, 1, function(x) rbind(x, interval_voters - x), 
      simplify = FALSE)
chisq_out <- lapply(tables, chisq.test, correct = FALSE)
p_values <- sapply(chisq_out, function(x)x$p.value)
qqplot(ppoints(27), p_values)
qqline(p_values, distribution = qunif)

Finally the authors mention that a single test on the entire 27 x 6 table could be performed. This seems like the easiest approach of all.

chisq.test(interval_votes, correct = F)
## 
##  Pearson's Chi-squared test
## 
## data:  interval_votes
## X-squared = 114.72, df = 130, p-value = 0.8279

My R code differs quite a bit from the R code provided by the authors. I’m not saying mine is better, it just makes more sense to me. Maybe someone else will find this approach useful.

Parametric Bootstrap of Kolmogorov–Smirnov Test

Zeimbekakis, et al. recently published an article in The American Statistician titled On Misuses of the Kolmogorov–Smirnov Test for One-Sample Goodness-of-Fit. One of the misues they discuss is using the KS test with parameters estimated from the sample. For example, let’s sample some data from a normal distribution.

x <- rnorm(200, mean = 8, sd = 8)
c(xbar = mean(x), s = sd(x))
##     xbar        s 
## 8.333385 7.979586

If we wanted to assess the goodness-of-fit of this sample to a normal distribution, the following is a bad way to use the KS test:

ks.test(x, "pnorm", mean(x), sd(x))
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.040561, p-value = 0.8972
## alternative hypothesis: two-sided

The appropriate way to use the KS test is to actually supply hypothesized parameters. For example:

ks.test(x, "pnorm", 8, 8)
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.034639, p-value = 0.9701
## alternative hypothesis: two-sided

The results of both tests are the same. We fail to reject the null hypothesis that the sample is from a Normal distribution with the stated mean and standard deviation. However, the former test is very conservative. Zeimbekakis, et al. show this via simulation. I show a simplified version of this simulation. The basic idea is that if the test were valid, the p-values would be uniformly distributed and the points in the uniform distribution QQ-plot would fall along a diagonal line. Clearly that’s not the case.

n <- 200
rout <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  xbar <- mean(x)
  s <- sd(x)
  ks.test(x, "pnorm", xbar, s)$p.value
})
hist(rout, main = "Histogram of p-values")

qqplot(x = ppoints(n), y = rout, main = "Uniform QQ-plot")
qqline(rout, distribution = qunif)

Conclusion: using fitted parameters in place of the true parameters in the KS test yields conservative results. The authors state in the abstract that this “has been ‘discovered’ multiple times.”

When done the right way, the KS test yields uniformly distributed p-values.

rout2 <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  ks.test(x, "pnorm", 8, 8)$p.value
})
hist(rout2)

qqplot(x = ppoints(n), y = rout2, main = "Uniform QQ-plot")
qqline(rout2, distribution = qunif)

Obviously it’s difficult to know which parameters to supply to the KS test. Above we knew to supply 8 as the mean and standard deviation because that’s what we used to generate the data. But what to do in real life? Zeimbekakis, et al. propose a parametric bootstrap to approximate the null distribution of the KS test statistic. The steps to implement the bootstrap are as follows:

  1. draw a random sample from the fitted distribution
  2. get estimates of parameters of random sample
  3. obtain the empirical distribution function
  4. calculate the bootstrapped KS statistic
  5. repeat steps 1 – 4 many times

Let’s do it. The following code is a simplified version of what the authors provide with the paper. Notice they use MASS::fitdistr() to obtain MLE parameter estimates. This returns the same mean for the normal distribution but a slightly smaller (i.e. biased) estimated standard deviation.

param  <- MASS::fitdistr(x, "normal")$estimate
ks <- ks.test(x, function(x)pnorm(x, param[1], param[2]))
stat <- ks$statistic
B <- 1000
stat.b <- double(B)
n <- length(x)

## bootstrapping
for (i in 1:B) {
  # (1) draw a random sample from a fitted dist
  x.b <- rnorm(n, param[1], param[2])
  # (2) get estimates of parameters of random sample
  fitted.b <- MASS::fitdistr(x.b, "normal")$estimate
  # (3) get empirical distribution function
  Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2])
  # (4) calculate bootstrap KS statistic
  stat.b[i] <- ks.test(x.b, Fn)$statistic
}
mean(stat.b >= stat)
## [1] 0.61

The p-value is the proportion of statistics greater than or equal to the observed statistic calculated with estimated parameters.

Let’s turn this into a function and show that it returns uniformly distributed p-values when used with multiple samples. Again this is a simplified version of the R code the authors generously shared with their paper.

ks.boot <- function(x, B = 1000){
  param  <- MASS::fitdistr(x, "normal")$estimate
  ks <- ks.test(x, function(k)pnorm(k, param[1], param[2]))
  stat <- ks$statistic
  stat.b <- double(B)
  n <- length(x)
  for (i in 1:B) {
    x.b <- rnorm(n, param[1], param[2])
    fitted.b <- MASS::fitdistr(x.b, "normal")$estimate
    Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2])
    stat.b[i] <- ks.test(x.b, Fn)$statistic
  }
  mean(stat.b >= stat)
}

Now replicate the function with many samples. This takes a moment to run. It took my Windows 11 PC with an Intel i7 chip about 100 seconds to run.

rout_boot <- replicate(n = 1000, expr = {
  x <- rnorm(n, 8 , 8)
  ks.boot(x)
})
hist(rout_boot)

qqplot(x = ppoints(n), y = rout_boot, main = "Uniform QQ-plot")
qqline(rout_boot, distribution = qunif)