# Parametric Bootstrap of Kolmogorov–Smirnov Test

Zeimbekakis, et al. recently published an article in The American Statistician titled On Misuses of the Kolmogorov–Smirnov Test for One-Sample Goodness-of-Fit. One of the misues they discuss is using the KS test with parameters estimated from the sample. For example, let’s sample some data from a normal distribution.

x <- rnorm(200, mean = 8, sd = 8)
c(xbar = mean(x), s = sd(x))
##     xbar        s
## 8.333385 7.979586

If we wanted to assess the goodness-of-fit of this sample to a normal distribution, the following is a bad way to use the KS test:

ks.test(x, "pnorm", mean(x), sd(x))
##
##  Asymptotic one-sample Kolmogorov-Smirnov test
##
## data:  x
## D = 0.040561, p-value = 0.8972
## alternative hypothesis: two-sided

The appropriate way to use the KS test is to actually supply hypothesized parameters. For example:

ks.test(x, "pnorm", 8, 8)
##
##  Asymptotic one-sample Kolmogorov-Smirnov test
##
## data:  x
## D = 0.034639, p-value = 0.9701
## alternative hypothesis: two-sided

The results of both tests are the same. We fail to reject the null hypothesis that the sample is from a Normal distribution with the stated mean and standard deviation. However, the former test is very conservative. Zeimbekakis, et al. show this via simulation. I show a simplified version of this simulation. The basic idea is that if the test were valid, the p-values would be uniformly distributed and the points in the uniform distribution QQ-plot would fall along a diagonal line. Clearly that’s not the case.

n <- 200
rout <- replicate(n = 1000, expr = {
x <- rnorm(n, 8 , 8)
xbar <- mean(x)
s <- sd(x)
ks.test(x, "pnorm", xbar, s)$p.value }) hist(rout, main = "Histogram of p-values") qqplot(x = ppoints(n), y = rout, main = "Uniform QQ-plot") qqline(rout, distribution = qunif) Conclusion: using fitted parameters in place of the true parameters in the KS test yields conservative results. The authors state in the abstract that this “has been ‘discovered’ multiple times.” When done the right way, the KS test yields uniformly distributed p-values. rout2 <- replicate(n = 1000, expr = { x <- rnorm(n, 8 , 8) ks.test(x, "pnorm", 8, 8)$p.value
})
hist(rout2)

qqplot(x = ppoints(n), y = rout2, main = "Uniform QQ-plot")
qqline(rout2, distribution = qunif)

Obviously it’s difficult to know which parameters to supply to the KS test. Above we knew to supply 8 as the mean and standard deviation because that’s what we used to generate the data. But what to do in real life? Zeimbekakis, et al. propose a parametric bootstrap to approximate the null distribution of the KS test statistic. The steps to implement the bootstrap are as follows:

1. draw a random sample from the fitted distribution
2. get estimates of parameters of random sample
3. obtain the empirical distribution function
4. calculate the bootstrapped KS statistic
5. repeat steps 1 – 4 many times

Let’s do it. The following code is a simplified version of what the authors provide with the paper. Notice they use MASS::fitdistr() to obtain MLE parameter estimates. This returns the same mean for the normal distribution but a slightly smaller (i.e. biased) estimated standard deviation.

param  <- MASS::fitdistr(x, "normal")$estimate ks <- ks.test(x, function(x)pnorm(x, param[1], param[2])) stat <- ks$statistic
B <- 1000
stat.b <- double(B)
n <- length(x)

## bootstrapping
for (i in 1:B) {
# (1) draw a random sample from a fitted dist
x.b <- rnorm(n, param[1], param[2])
# (2) get estimates of parameters of random sample
fitted.b <- MASS::fitdistr(x.b, "normal")$estimate # (3) get empirical distribution function Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2]) # (4) calculate bootstrap KS statistic stat.b[i] <- ks.test(x.b, Fn)$statistic
}
mean(stat.b >= stat)
## [1] 0.61

The p-value is the proportion of statistics greater than or equal to the observed statistic calculated with estimated parameters.

Let’s turn this into a function and show that it returns uniformly distributed p-values when used with multiple samples. Again this is a simplified version of the R code the authors generously shared with their paper.

ks.boot <- function(x, B = 1000){
param  <- MASS::fitdistr(x, "normal")$estimate ks <- ks.test(x, function(k)pnorm(k, param[1], param[2])) stat <- ks$statistic
stat.b <- double(B)
n <- length(x)
for (i in 1:B) {
x.b <- rnorm(n, param[1], param[2])
fitted.b <- MASS::fitdistr(x.b, "normal")$estimate Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2]) stat.b[i] <- ks.test(x.b, Fn)$statistic
}
mean(stat.b >= stat)
}

Now replicate the function with many samples. This takes a moment to run. It took my Windows 11 PC with an Intel i7 chip about 100 seconds to run.

rout_boot <- replicate(n = 1000, expr = {
x <- rnorm(n, 8 , 8)
ks.boot(x)
})
hist(rout_boot)

qqplot(x = ppoints(n), y = rout_boot, main = "Uniform QQ-plot")
qqline(rout_boot, distribution = qunif)

# Using a Bootstrap to Estimate Power and Significance Level

I’ve been reading Common Errors in Statistics (and How to Avoid Them) by Phillip Good and James Hardin. It’s a good bathroom/bedtime book. You can pick it up and put it down as you please. Each chapter is self-contained and contains bite-size, easy-to-read sections. I’m really happy with it so far.

Anyway, chapter 3 had a section on computing Power and sample size that inspired me to hop on the computer:

If the data do not come from one of the preceding distributions, then we might use a bootstrap to estimate the power and signiﬁcance level.

In preliminary trials of a new device, the following test results were observed: 7.0 in 11 out of 12 cases and 3.3 in 1 out of 12 cases. Industry guidelines speciﬁed that any population with a mean test result greater than 5 would be acceptable. A worst-case or boundary-value scenario would include one in which the test result was 7.0 3/7th of the time, 3.3 3/7th of the time, and 4.1 1/7th of the time. i.e., $$(7 \times \frac{3}{7}) + (3.3 \times \frac{3}{7}) + (4.1 \times \frac{1}{7}) = 5$$

The statistical procedure required us to reject if the sample mean of the test results were less than 6. To determine the probability of this event for various sample sizes, we took repeated samples with replacement from the two sets of test results.

If you want to try your hand at duplicating these results, simply take the test values in the proportions observed, stick them in a hat, draw out bootstrap samples with replacement several hundred times, compute the sample means, and record the results.

Well of course I want to try my hand at duplicating the results. Who wouldn’t?

The idea here is to bootstrap from two samples: (1) the one they drew in the preliminary trial with mean = 6.69, and (2) the hypothetical worst-case boundary example with mean = 5. We bootstrap from each and calculate the proportion of samples with mean less than 6. The proportion of results with mean less than 6 from the first population (where true mean = 6.69) can serve as a proxy for Type I error or the significance level. This is proportion of times we make the wrong decision. We conclude the mean is less than 6 when in fact it’s really 6.69. The proportion of results with mean less than 6 from the second population (where true mean = 5) can serve as a proxy for Power. This is proportion of times we make the correct decision. We conclude the mean is less than 6 when in fact it’s really 5.

In the book they show the following table of results:

We see they have computed the significance level (Type I error) and power for three different sample sizes. Here’s me doing the same thing in R:

# starting sample of test results (mean = 6.69)
el1 <- c(7.0,3.3)
prob1 <- c(11/12, 1/12)

# hypothetical worst-case population (mean = 5)
el2 <- c(7, 3.3, 4.1)
prob2 <- c(3/7, 3/7, 1/7)

n <- 1000
for (j in 3:5){ # loop through sample sizes
m1 <- double(n)
m2 <- double(n)
for (i in 1:n) {
m1[i] <- mean(sample(el1,j, replace=TRUE,prob=prob1)) # test results
m2[i] <- mean(sample(el2,j, replace=TRUE,prob=prob2)) # worst-case
}
print(paste("Type I error for sample size =",j,"is",sum(m1 < 6.0)/n))
print(paste("Power for sample size =",j,"is",sum(m2 < 6.0)/n))
}



To begin I define vectors containing the values and their probability of occurrence. Next I set n = 1000 because I want to do 1000 bootstrap samples. Then I start the first of two for loops. The first is for my sample sizes (3 - 5) and the next is for my bootstrap samples. Each time I begin a new sample size loop I need to create two empty vectors to store the means from each bootstrap sample. I calls these m1 and m2. As I loop through my 1000 bootstrap samples, I take the mean of each sample and assign to the ith element of the m1 and m2 vectors. m1 holds the sample means from the test results and m2 holds the sample means from the worst-case boundary scenario. Finally I print the results using the paste function. Notice how I calculate the proportion. I create a logical vector by calling mx < 6.0. This returns a vector of 0s and 1s, where 0 is false and 1 is true. I then sum this vector to get the number of times the mean was less than 6. Dividing that by n (1000) gives me the proportion. Here are my results:

[1] "Type I error for sample size = 3 is 0.244"
[1] "Power for sample size = 3 is 0.845"
[1] "Type I error for sample size = 4 is 0.04"
[1] "Power for sample size = 4 is 0.793"
[1] "Type I error for sample size = 5 is 0.067"
[1] "Power for sample size = 5 is 0.886"


Pretty much the same thing! I guess I could have used the boot function in the boot package to do this. That’s probably more efficient. But this was a clear and easy way to duplicate their results.