Hypothesis testing: section 4.6 of Regression and Other Stories

Gelman, et al. tell a story from long ago where someone sent them a fax (that’s right, a fax) asking for help with suspected voter fraud. The story is in section 4.6 (page 63) and is included to provide an example of constructing a hypothesis test. They provide data and code for this example in the Coop folder on Github. The point of this post is to document some changes I made to the code to help me understand it.

The story involves the election of a board of directors for a “residential organization”. 5553 people were allowed to vote for up to 6 people. 27 candidates were running for the board. Votes were tallied after 600 people voted, then again at 1200, 2444, 3444, 4444, and the end after all 5553 people voted. What aroused suspicion was the fact that the proportion of votes for the candidates remained steady each time the votes were tallied. According to the author of the fax: “the election was rigged…[it] is a fixed vote with fixed percentages being assigned to each and every candidate making it impossible to participate in an honest election.”

Let’s read in the data and demonstrate what they’re talking about. Notice this data is the rare CSV without column headers. The data consists of 27 rows, one for each candidate, showing cumulative vote totals.

data <- read.csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Coop/data/Riverbay.csv", 
                 header = FALSE)
# drop 1st and 8th columns; contain candidate names which we don't need.
votes <- data[,2:7]
head(votes)

##    V2  V3  V4   V5   V6   V7
## 1 208 416 867 1259 1610 2020
## 2  55 106 215  313  401  505
## 3 133 250 505  716  902 1129
## 4 101 202 406  589  787  976
## 5 108 249 512  745  970 1192
## 6  54  94 196  279  360  451

Now let’s calculate the proportion of votes received at each interval and create a basic line plot. Each line below represents proportion of votes received for a candidate at each of the six intervals. Notice how the lines are mostly flat. This is what prompted the emergency fax.

vote_p <- apply(votes, 2, proportions)
matplot(t(vote_p), type = "l", col = 1, lty = 1)

Gelman, et al. demonstrate this using separate plots for the top 8 vote-getters (Fig 4.5). They also divide by number of voters instead of total votes received. (Remember, each voter gets to vote for up to six people.) This simply changes the denominator, and hence, the y-axis. The steady vote patterns remain.

voters <- c(600,1200,2444,3444,4444,5553)
vote_p <- sweep(votes, 2, voters, FUN = "/")
matplot(t(vote_p), type = "l", col = 1, lty = 1)

They note that the data in this plot is not independent since proportions at times 2 and beyond include votes that came before. To address this, they create a matrix that contains number of votes received at each interval instead of cumulative totals.

interval_votes <- t(apply(votes, 1, diff))
interval_votes <- cbind(votes[,1], interval_votes)
head(interval_votes)

##           V3  V4  V5  V6  V7
## [1,] 208 208 451 392 351 410
## [2,]  55  51 109  98  88 104
## [3,] 133 117 255 211 186 227
## [4,] 101 101 204 183 198 189
## [5,] 108 141 263 233 225 222
## [6,]  54  40 102  83  81  91

After taking differences the lines still seem mostly stable.

interval_p <- apply(interval_votes, 2, proportions)
matplot(t(interval_p), type = "l", col = 1, lty = 1)

Again, the authors divide by number of voters instead of total votes to create these plots, but the result is the same with a different y-axis. Here’s how I would do the calculations and create the plot.

interval_voters <- c(600, diff(voters))
interval_p <- sweep(interval_votes, 2, interval_voters, FUN = "/")
matplot(t(interval_p), type = "l", col = 1, lty = 1)

And now comes the hypothesis test. What is the probability of seeing steady proportions like this if the votes really were coming in at random? I’ll quote the book here: “Because the concern was that the votes were unexpectedly stable as the count proceeded, we define a test statistic to summarize variability.” The test statistic in this case is the standard deviations of the sample proportions. We can quickly get these from the interval_p object we created above.

test_stat <- apply(interval_p, 1, sd)

Now we need to calculate the theoretical test statistic. For this we assume each candidate has a fixed but unknown proportion of voters who will vote for them, \(\pi_i\). Under the null, the six intervals where votes are tallied are random samples of the voters. So at each time point we can think of the proportion as a draw from a distribution with mean \(\pi_i\) and standard deviation \(\sqrt{\pi_i(1 – \pi_i)/n_t}\), where \(n_t\) is the number of voters at each interval. To calculate this, we first need to estimate \(\pi_i\) with \(p_i\), the observed proportion of votes each candidate received. This is the last column of the votes data frame divided by the total number of voters, 5553.

p_hat <- votes[,6]/5553

Then we take the average of the variances calculated at each time point and take the square root to get the theoretical test statistic.

theory_test_stat <- sapply(p_hat, function(x)sqrt(mean(x*(1-x)/interval_voters)))

Under the null, the observed test statistics should be very close to the theoretical test statistics. This is assessed in Fig 4.7 in the book. I replicate the plot as follows:

plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
     ylab = "sd of separate vote proportions")
points(x = votes[,6], y = theory_test_stat, pch = 19)

The authors note that “the actual standard deviations appear consistent with the theoretical model.”

Personally I think the plot would be a little more effective if they zoomed out a little. Some of the dramatic looking departures are only off by 0.01. For example:

plot(x = votes[,6], y = test_stat, xlab = "total # of votes for the candidate",
     ylab = "sd of separate vote proportions", ylim = c(0,0.05))
points(x = votes[,6], y = theory_test_stat, pch = 19)

Another null hypothesis approach is the chi-square test of association. Under the null, the number of votes is not associated with the interval when votes were tallied. We can run this test for each candidate and look at the p-values. If there is no association for each candidate we should see a fairly uniform scatter of p-values. On the other hand, if there was “suspiciously little variation over time” we would see a surplus of high p-values. Here’s how I carried out these calculations. I first created the 2-way tables of yes/no versus time for each candidate. I then applied the chi-square test to each table, and to that result, I extracted each p-value. A uniform QQ plot shows the p-values are mostly uniformly distributed.

tables <- apply(interval_votes, 1, function(x) rbind(x, interval_voters - x), 
      simplify = FALSE)
chisq_out <- lapply(tables, chisq.test, correct = FALSE)
p_values <- sapply(chisq_out, function(x)x$p.value)
qqplot(ppoints(27), p_values)
qqline(p_values, distribution = qunif)

Finally the authors mention that a single test on the entire 27 x 6 table could be performed. This seems like the easiest approach of all.

chisq.test(interval_votes, correct = F)

## 
##  Pearson's Chi-squared test
## 
## data:  interval_votes
## X-squared = 114.72, df = 130, p-value = 0.8279

My R code differs quite a bit from the R code provided by the authors. I’m not saying mine is better, it just makes more sense to me. Maybe someone else will find this approach useful.

Hypothesis testing: section 4.6 of Regression and Other Stories

Clay Ford

Leave a Reply Cancel reply