# Parametric Bootstrap of Kolmogorov–Smirnov Test

Zeimbekakis, et al. recently published an article in The American Statistician titled On Misuses of the Kolmogorov–Smirnov Test for One-Sample Goodness-of-Fit. One of the misues they discuss is using the KS test with parameters estimated from the sample. For example, let’s sample some data from a normal distribution.

x <- rnorm(200, mean = 8, sd = 8)
c(xbar = mean(x), s = sd(x))
##     xbar        s
## 8.333385 7.979586

If we wanted to assess the goodness-of-fit of this sample to a normal distribution, the following is a bad way to use the KS test:

ks.test(x, "pnorm", mean(x), sd(x))
##
##  Asymptotic one-sample Kolmogorov-Smirnov test
##
## data:  x
## D = 0.040561, p-value = 0.8972
## alternative hypothesis: two-sided

The appropriate way to use the KS test is to actually supply hypothesized parameters. For example:

ks.test(x, "pnorm", 8, 8)
##
##  Asymptotic one-sample Kolmogorov-Smirnov test
##
## data:  x
## D = 0.034639, p-value = 0.9701
## alternative hypothesis: two-sided

The results of both tests are the same. We fail to reject the null hypothesis that the sample is from a Normal distribution with the stated mean and standard deviation. However, the former test is very conservative. Zeimbekakis, et al. show this via simulation. I show a simplified version of this simulation. The basic idea is that if the test were valid, the p-values would be uniformly distributed and the points in the uniform distribution QQ-plot would fall along a diagonal line. Clearly that’s not the case.

n <- 200
rout <- replicate(n = 1000, expr = {
x <- rnorm(n, 8 , 8)
xbar <- mean(x)
s <- sd(x)
ks.test(x, "pnorm", xbar, s)$p.value }) hist(rout, main = "Histogram of p-values") qqplot(x = ppoints(n), y = rout, main = "Uniform QQ-plot") qqline(rout, distribution = qunif) Conclusion: using fitted parameters in place of the true parameters in the KS test yields conservative results. The authors state in the abstract that this “has been ‘discovered’ multiple times.” When done the right way, the KS test yields uniformly distributed p-values. rout2 <- replicate(n = 1000, expr = { x <- rnorm(n, 8 , 8) ks.test(x, "pnorm", 8, 8)$p.value
})
hist(rout2)

qqplot(x = ppoints(n), y = rout2, main = "Uniform QQ-plot")
qqline(rout2, distribution = qunif)

Obviously it’s difficult to know which parameters to supply to the KS test. Above we knew to supply 8 as the mean and standard deviation because that’s what we used to generate the data. But what to do in real life? Zeimbekakis, et al. propose a parametric bootstrap to approximate the null distribution of the KS test statistic. The steps to implement the bootstrap are as follows:

1. draw a random sample from the fitted distribution
2. get estimates of parameters of random sample
3. obtain the empirical distribution function
4. calculate the bootstrapped KS statistic
5. repeat steps 1 – 4 many times

Let’s do it. The following code is a simplified version of what the authors provide with the paper. Notice they use MASS::fitdistr() to obtain MLE parameter estimates. This returns the same mean for the normal distribution but a slightly smaller (i.e. biased) estimated standard deviation.

param  <- MASS::fitdistr(x, "normal")$estimate ks <- ks.test(x, function(x)pnorm(x, param[1], param[2])) stat <- ks$statistic
B <- 1000
stat.b <- double(B)
n <- length(x)

## bootstrapping
for (i in 1:B) {
# (1) draw a random sample from a fitted dist
x.b <- rnorm(n, param[1], param[2])
# (2) get estimates of parameters of random sample
fitted.b <- MASS::fitdistr(x.b, "normal")$estimate # (3) get empirical distribution function Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2]) # (4) calculate bootstrap KS statistic stat.b[i] <- ks.test(x.b, Fn)$statistic
}
mean(stat.b >= stat)
## [1] 0.61

The p-value is the proportion of statistics greater than or equal to the observed statistic calculated with estimated parameters.

Let’s turn this into a function and show that it returns uniformly distributed p-values when used with multiple samples. Again this is a simplified version of the R code the authors generously shared with their paper.

ks.boot <- function(x, B = 1000){
param  <- MASS::fitdistr(x, "normal")$estimate ks <- ks.test(x, function(k)pnorm(k, param[1], param[2])) stat <- ks$statistic
stat.b <- double(B)
n <- length(x)
for (i in 1:B) {
x.b <- rnorm(n, param[1], param[2])
fitted.b <- MASS::fitdistr(x.b, "normal")$estimate Fn <- function(x)pnorm(x, fitted.b[1], fitted.b[2]) stat.b[i] <- ks.test(x.b, Fn)$statistic
}
mean(stat.b >= stat)
}

Now replicate the function with many samples. This takes a moment to run. It took my Windows 11 PC with an Intel i7 chip about 100 seconds to run.

rout_boot <- replicate(n = 1000, expr = {
x <- rnorm(n, 8 , 8)
ks.boot(x)
})
hist(rout_boot)

qqplot(x = ppoints(n), y = rout_boot, main = "Uniform QQ-plot")
qqline(rout_boot, distribution = qunif)

# Most useful tests for an ANCOVA model

In his book Regression Modeling Strategies, 2nd Ed, Frank Harrell provides a list of what he calls the “most useful tests” for a 2-level factor $$\times$$ numeric model (Table 2.2, p. 19). This is often called an Analysis of Covariance, or ANCOVA. The basic idea is we have a numeric response with substantial variability and we seek to understand the variability by modeling the mean of the response as a function of a categorical variable and a numeric variable.

Let’s simulate some data for such a model and then see how we can use R to carry out these tests.

n <- 400
set.seed(1)
sex <- factor(sample(x = c("f", "m"), size = n, replace = TRUE))
age <- round(runif(n = n, min = 18, max = 65))
y <- 1 + 0.8*age + 0.4*(sex == "m") - 0.7*age*(sex == "m") + rnorm(n, mean = 0, sd = 8)
dat <- data.frame(y, age, sex)


The data contain a numeric response, y, that is a function of age and sex. I set the “true” coefficient values to 1, 0.8, 0.4, and -0.7. They correspond to $$\beta_0$$ through $$\beta_3$$ in the following model:

$y = \beta_0 + \beta_1 age + \beta_2 sex + \beta_3 age \times sex$

In addition the error component is a Normal distribution with a standard deviation of 8.

Now let’s model the data and see how close we get to recovering the true parameter values.

mod <- lm(y ~ age * sex, dat)
summary(mod)

##
## Call:
## lm(formula = y ~ age * sex, data = dat)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -23.8986  -5.8552  -0.2503   6.0507  30.6188
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.27268    1.93776   0.141    0.888
## age          0.79781    0.04316  18.484   <2e-16 ***
## sexm         2.07143    2.84931   0.727    0.468
## age:sexm    -0.72702    0.06462 -11.251   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.661 on 396 degrees of freedom
## Multiple R-squared:  0.7874, Adjusted R-squared:  0.7858
## F-statistic:   489 on 3 and 396 DF,  p-value: < 2.2e-16


While the coefficient estimates for age and the age $$\times$$ sex interaction are pretty close to the true values, the same cannot be said for the intercept and sex coefficients. The residual standard error of 8.661 is close to the true value of 8.

We can see in the summary output of the model that four hypothesis tests, one for each coefficient, are carried out for us. Each are testing if the coefficient is equal to 0. Of those four, only one qualifies as one of the most useful tests: the last one for age:sexm. This tests if the effect of age is independent of sex and vice versa. Stated two other ways, it tests if age and sex are additive, or if the age effect is the same for both sexes. To get a better understanding of what we’re testing, let’s plot the data with fitted age slopes for each sex.

library(ggplot2)
ggplot(dat, aes(x = age, y = y, color = sex)) +
geom_point() +
geom_smooth(method="lm")


Visually it appears the effect of age is not independent of sex. It seems more pronounced for females. Is this effect real or maybe due to chance? The hypothesis test in the summary output for age:sexm evaluates this. Obviously the effect seems very real. We are not likely to see such a difference in slopes this large if there truly was no difference. It does appear the effect of age is different for each sex. The estimate of -0.72 estimates the difference in slopes (or age effect) for the males and females.

The other three hypothesis tests are not very useful.

• Testing if the Intercept is 0 is testing whether y is 0 for females at age 0.
• Testing if age is 0 is testing whether age is associated with y for males.
• Testing if sexm is 0 is testing whether sex is associated with y for subjects at age 0.

Other more useful tests, as Harrell outlines in Table 2.2, are as follows:

• Is age associated with y?
• Is sex associated with y?
• Are either age or sex associated with y?

The last one is answered in the model output. That’s the F-statistic in the last line. It tests whether all coefficients (except the intercept) are equal to 0. The result of this test is conclusive. At least one of the coeffcients is not 0.

To test if age is associated with y, we need to test if both the age and age:sexm coefficents are equal to 0. The car package by John Fox provides a nice function for this purpose called linearHypothesis. It takes at least two arguments. The first is the fitted model object and the second is a vector of hypothesis tests. Below we specify we want to test if “age = 0” and “age:sexm = 0”

library(car)
linearHypothesis(mod, c("age = 0", "age:sexm = 0"))

## Linear hypothesis test
##
## Hypothesis:
## age = 0
## age:sexm = 0
##
## Model 1: restricted model
## Model 2: y ~ age * sex
##
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)
## 1    398 55494
## 2    396 29704  2     25790 171.91 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


The result is once again conclusive. The p-value is virtually 0. It does indeed appear that age is associated with y.

Likewise, to test if sex is associated with y, we need to test if both the sex and age:sexm coefficents are equal to 0.

linearHypothesis(mod, c("sexm = 0", "age:sexm = 0"))

## Linear hypothesis test
##
## Hypothesis:
## sexm = 0
## age:sexm = 0
##
## Model 1: restricted model
## Model 2: y ~ age * sex
##
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)
## 1    398 119354
## 2    396  29704  2     89651 597.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


As expected this test confirms that sex is associated with y, just as we specified when we simulated the data.

Now that we have established that age is associated with y, and that the association differs for each sex, what exactly is that association for each sex? In other words what are the slopes of the lines in our plot above?

We can sort of answer that with the model coefficients.

round(coef(mod),3)

## (Intercept)         age        sexm    age:sexm
##       0.273       0.798       2.071      -0.727


That corresponds to the following model:

$y = 0.273 + 0.799 age + 2.071 sex – 0.727 age \times sex$

When sex is female, the fitted model is

$y = 0.273 + 0.799 age$

This says the slope of the age is about 0.8 when sex is female.

When sex is male, the fitted model is

$y = (0.273 + 2.071) + (0.797 – 0.727) age$
$y = 2.344 + 0.07 age$

This says the slope of the age is about 0.07 when sex is male.

How certain are we about these estimates? That’s what standard error is for. For the age slope estimate for females the standard error is provided in the model output for the age coefficient. It shows about 0.04. Adding and subtracting 2 $$\times$$ 0.04 to the coefficient gives us a rough 95% confidence interval. Or we could just use the confint function:

confint(mod, parm = "age")

##         2.5 %    97.5 %
## age 0.7129564 0.8826672


The standard error of the age slope estimate for males takes a little more work. Another car function useful for this is the deltaMethod function. It takes at least three arguments: the model object, the quantity expressed as a character phrase that we wish to estimate a standard error for, and the names of the parameters. The function then calculates the standard error using the delta method. Here’s one way to do it for our model

deltaMethod(mod, "b1 + b3", parameterNames = paste0("b", 0:3))

##           Estimate         SE       2.5 %    97.5 %
## b1 + b3 0.07079277 0.04808754 -0.02345709 0.1650426


The standard error is similar in magnitude, but since our estimate is so small the resulting confidence interval overlaps 0. This tells us the effect of age on males is too small for our data to determine if the effect is positive or negative.

Another way to get the estimated age slopes for each sex, along with standard errors and confidence intervals, is to use the margins package. We use the margins function with our model object and specify that we want to estimate the marginal effect of age at each level of sex. (“marginal effect of age” is another way of saying the effect of age at each level of sex)

library(margins)
margins(mod, variables = "age", at = list(sex = c("f", "m")))

## Average marginal effects at specified values

## lm(formula = y ~ age * sex, data = dat)

##  at(sex)     age
##        f 0.79781
##        m 0.07079


This does the formula work we did above. It plugs in sex and returns the estmimated slope coefficient for age. If we wrap the call in summary we get the standard errors and confidence intervals.

summary(margins(mod, variables = "age", at = list(sex = c("f", "m"))))

##  factor    sex    AME     SE       z      p   lower  upper
##     age 1.0000 0.7978 0.0432 18.4841 0.0000  0.7132 0.8824
##     age 2.0000 0.0708 0.0481  1.4722 0.1410 -0.0235 0.1650


# Revisiting some old R-code

One of my first blog posts on this site simulated rolling a die 6 times and observing if side i was observed on the ith roll at least once. (For example, rolling a 3 on my 3rd roll would be a success.) I got the idea from one of my statistics textbooks which had a nice picture of the simulation converging to the true probability of 0.665 over the course of 500 trials. I was able to recreate the simulation in R and I guess I got excited and blogged about it. I was reminded of this today when I opened that old textbook and came across the R code that I had actually written in the margin by hand. Apparently I was super proud of my ground-breaking R code and decided to preserve it for posterity in writing. smh

Over 5 years later my precious R code looks pretty juvenile. It’s way more complicated than it needs to be and doesn’t take advantage of R's vectorized calculations. As Venables and Ripley say in MASS, “Users coming to S from other languages are often slow to take advantage of the power of S to do vectorized calculations…This often leads to unnecessary loops.” Indeed. I'm living proof of that.

I shouldn’t be too hard on my myself though. I was not yet familiar with functions like replicate and cumsum. And I was more focused on recreating the plot than writing optimal R code. I went with what I knew. And R, so forgiving and flexible, accommodated my novice enthusiasm.

Here is how I would approach the problem today:

r.out <- replicate(n = 500, any(sample(1:6, size = 6, replace = T) == 1:6))
p.out <- cumsum(r.out)/seq(500)
plot(x = seq(500), y = p.out, type = "l", ylim = c(0,1),
main = "Convergence to probability as n increases", xlab = "n")
abline(h = 0.665)


On line 1, we use the sample function to “roll a die 6 times” by sampling the numbers 1 – 6, with replacement, 6 times. Then we compare the 6 results with the vector of numbers 1 – 6 using the == operator and use the any function to check if any are TRUE. Next we replicate that 500 times and store the result in r.out. This is a vector of TRUE/FALSE values which R treats numerically as 1 and 0. This means we can use cumsum to find the cumulative sum of successes. To determine the cumulative proportion of successes, we divide each cumulative sum by the trial number. The result is a vector of proportions that should start converging to 0.665. Finally we plot using base R plot and abline.

This is more efficient than my original attempt 5 years ago and better captures the spirit of the simulation. I'm sure 5 years from now if I stumble upon this post I'll have yet another more elegant way to do it. I'm already looking at it thinking, “I should have generalized this with a function, and used ggplot2 to make the graph. And I shouldn't do seq(500) twice.” In fact I know I could have avoided the replicate function by using the fact that there's a probability of $$\frac{1}{6}$$ of observing side i on the ith roll of a die. So I could have used a single rbinom call to do the simulation, like so:

r.out2 <- cumsum(rbinom(n = 500, size = 6, prob = 1/6) > 0)
p.out2 <- r.out2/seq(500)
plot(x = seq(500), y = p.out2, type = "l", ylim = c(0,1),
main = "Convergence to probability as n increases", xlab = "n")
abline(h = 0.665)


In this version instead of simulating 6 literal die rolls, we simulate the number of successes in 6 die rolls. We turn each roll of the die into a binomial event: success or failure. The rbinom function allows us to simulate binomial events where size is the number of trials (or rolls in this case) and prob is the probability of success at each trial. So rbinom(n = 1, size = 6, prob = 1/6) would return a number ranging 0 to 6 indicating the number of success. Think of it as flipping 6 coins, each with probability of getting heads as $$\frac{1}{6}$$, and then counting the number of heads we observed. Setting the n argument to 500 replicates it 500 times. After that it's simply a matter of logically checking which outcomes were greater than 0 and using cumsum on the resulting TRUE/FALSE vector.

This version is way faster. I mean way faster. Compare the time it takes it to do each 1,000,000 times:

system.time({
r.out <- replicate(n = 1e6, any(sample(1:6, size = 6, replace = T) == 1:6))
p.out <- cumsum(r.out)/seq(1e6)
})

##    user  system elapsed
##    5.26    0.00    5.26

system.time({
r.out2 <- cumsum(rbinom(n = 1e6, size = 6, prob = (1/6)) > 0)
p.out2 <- r.out2/seq(1e6)
})

##    user  system elapsed
##    0.06    0.00    0.06


It's not even close. Who was the dummy that wrote that first version with replicate?

But does the new faster version reflect the experimental setting better? Not really. Remember, we're demonstrating probability concepts with die rolls in the first chapter of an intro stats textbook. That's probably not the best time to break out rbinom. And the demo was for 500 trials, not 1,000,000. I had to ramp up the trials to see the speed difference. Maybe the “right” R code in this situation is not the fastest version but rather the one that's easier to understand.

# Permutation Tests

Let's talk about permutation tests and why we might want to do them.

First think about the two-sample t-test. The null hypothesis of this test is that both samples come from the same distribution. If we assume both samples come from the same approximately normal distribution, we can use math formulas based on probability theory to calculate a test statistic. We can then calculate the probability of getting such a test statistic (or one greater) under the assumption of both samples coming from the same distribution. For example, let's draw two samples from the same normal distribution and run a t-test.

set.seed(135)
s1 <- round(rnorm(10, 100, 16))
s2 <- round(rnorm(10, 100, 16))
t.test(s1, s2, var.equal = TRUE)

##
##  Two Sample t-test
##
## data:  s1 and s2
## t = 0.8193, df = 18, p-value = 0.4233
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.508 17.108
## sample estimates:
## mean of x mean of y
##     102.4      97.6


As expected the p-value is high, telling us the difference in their means is not significant. Another interpretation is that the resulting t-test statistic (0.8193) would be likely if both samples came from the same distribution.

In real life we don't know with certainty the distribution of the population from which our samples are drawn. We also don't know the variance of the distribution or whether the distribution is even symmetric. We also may be stuck with a small sample size. Fortunately the t-test is pretty robust and usually reliable even when its assumptions are wrong. However, if you have your doubts, you can try a permutation test.

In the case our two-sample example above, the permutation test takes all possible combinations of group membership and creates a permutation distribution. In other words, if we assume both samples came from the same population, a data point in group 1 is just as likely to appear in group 2. If we determine all possible permutations, we can calculate our statistic of interest for each permutation and create a distribution. We can then assess where our original statistic falls in this distribution. If it's in the tails then we have evidence that our two samples come from two different populations. Now if your sample sizes are bigger than say, 15, creating the permutations can get computationally expensive very fast. So this is probably not something you want to do unless you're dealing with small samples.

Here's how we can do a permutation test “by hand” in R with our fake data:

d0 <- mean(s1) - mean(s2)
alls <- c(s1, s2)  # combine into one vector
N <- length(alls)
n <- length(s1)
p <- combn(N, n)  # generate all combinations of n chosen from N
dmeans <- numeric(dim(p)[2])  # vector to store means
for (i in 1:dim(p)[2]) {
dmeans[i] <- mean(alls[p[, i]]) - mean(alls[-p[, i]])
}
# plot histogram of all possible difference in means with lines indicating
# our original difference
hist(dmeans)
abline(v = d0, lty = 2)
abline(v = -d0, lty = 2)


The combn function generates a matrix of all possible index combinations of n chosen from N. I say “index” because combn doesn't return the values themselves but rather their index position in the original vector. Thus we can use it to select all possible combinations from our combined vector, which is what we do in the for loop. In this case there are 184756 possible combinations. Once all possible differences in means are calculated we can plot a histogram of the differences and draw some lines to indicate where our original sample falls. Here I drew two lines to indicate the absolute value of the difference, the equivalent of a two-sided test. The probability (or p-value) of getting a difference more extreme than our original sample is pretty high. We can see this because there's a lot of histogram on either side of our lines. We can also calculate this value:

# two-sided p-value
signif((sum(dmeans <= -abs(d0)) + sum(dmeans >= abs(d0)))/dim(p)[2])

## [1] 0.4291


Very close to the p-value returned by the t-test.

Of course there are packages for this sort of thing. The one I know of is perm. Here's how we use its permTS function to replicate our results above:

library(perm)
permTS(s1, s2, alternative = "two.sided", method = "exact.ce",
control = permControl(tsmethod = "abs"))

##
##  Exact Permutation Test (complete enumeration)
##
## data:  s1 and s2
## p-value = 0.4291
## alternative hypothesis: true mean s1 - mean s2 is  0
## sample estimates:
## mean s1 - mean s2
##               4.8


The method= argument tells the function to do “complete enumeration”. The control= argument says how to calculate the p-value (i.e., the same way we did it “by hand”).

So that's a literal two-sample permutation test. As I mentioned, if your samples are large then this approach is not feasible as the number of permutations grows out of control. For example, two groups of size 20 results in 137,846,528,820 combinations. So we usually resort to resampling methods. This is where we repeatedly resample from our combined vector of values to get a large number of combinations. We don't generate all combinations, but enough to give us a good idea. Here's one way to do it:

dmeans <- numeric(2000)  # vector to store means
for (i in 1:2000) {
g <- sample(N, n)
dmeans[i] <- mean(alls[g]) - mean(alls[-g])
}
hist(dmeans)
abline(v = d0, lty = 2)
abline(v = -d0, lty = 2)


signif((sum(dmeans <= -abs(d0)) + sum(dmeans >= abs(d0)))/2000)

## [1] 0.4335


So out of 1.8476 × 105 possible permutations we only calculated 2000 (assuming no repeats) and we still got a p-value pretty close to that of our literal permutation test. Not bad.

We can do the same using the permTS function as follows:

permTS(s1, s2, alternative = "two.sided",
method = "exact.mc", control = permControl(nmc = 2000,
setSEED = FALSE, tsmethod = "abs"))

##
##  Exact Permutation Test Estimated by Monte Carlo
##
## data:  s1 and s2
## p-value = 0.4213
## alternative hypothesis: true mean s1 - mean s2 is  0
## sample estimates:
## mean s1 - mean s2
##               4.8
##
## p-value estimated from 2000 Monte Carlo replications
## 99 percent confidence interval on p-value:
##  0.3925 0.4498


The method “exact.mc” says use Monte Carlo simulation. The nmc= argument specifies number of replications (the default is 999). The setSEED= argument says not to set a seed. I want different random replications each time I run this line of code.

To wrap up, when does it make sense to use permutation tests? When you have something to permute! I quote from the classic text, An Introduction to the Bootstrap:

Permutation methods tend to apply to only a narrow range of problems. However when they apply, as in testing F = G in a two-sample problem, they give gratifyingly exact answers without parametric assumptions.

# Simulating responses from a linear model

Say you fit a model using R's lm() function. What you get in return are coefficients that represent an estimate of the linear function that gave rise to the data. The assumption is the response of the model is normally distributed with a mean equal to the linear function and a standard deviation equal to the standard deviation of the residuals. Using notation we express this as follows for a simple linear model with intercept and slope coefficients:

$Y_{i} \sim N(\beta_{0} + \beta_{1}x_{i},\sigma)$

The implication of this in the world of statistical computing is that we can simulate responses given values of x. R makes this wonderfully easy with its simulate() function. Simply pass it the model object and the number of simulations you want and it returns a matrix of simulated responses. A quick example:

set.seed(1)
# generate data from the linear function y = 3 + 4.2x with random noise
x <- seq(12, 23, length = 100)
y <- 3 + 4.2 * x + rnorm(100, 0, 5)
mod <- lm(y ~ x)
summary(mod)

##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -11.700  -3.029   0.078   2.926  11.487
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    3.900      2.504    1.56     0.12
## x              4.180      0.141   29.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.51 on 98 degrees of freedom
## Multiple R-squared:   0.9,   Adjusted R-squared:  0.899
## F-statistic:  882 on 1 and 98 DF,  p-value: <2e-16


Regressing y on x estimates the coefficients to be about 3.9 and 4.18, pretty close to the true values of 3 and 4.2. It also estimates the standard deviation of the error term to be 4.51, not too far from the 5 we specified in rnorm() when we generated the noise. Now let's use our model to simulate responses (i.e., y values).

head(simulate(mod, 5))

##   sim_1 sim_2 sim_3 sim_4 sim_5
## 1 51.26 55.90 58.09 58.91 54.40
## 2 54.71 62.14 49.79 63.08 53.18
## 3 50.87 62.15 63.88 52.26 49.64
## 4 56.16 53.96 53.72 53.69 55.50
## 5 52.96 45.60 63.38 54.04 60.39
## 6 64.35 67.65 63.20 54.68 63.57


The simulate() function returns a data frame. In this case the data frame has 100 rows and 5 columns. (I use the head() function to only show the first six rows.) Each column represents a simulated set of responses from our linear model given our x values. In the first row we have 5 values drawn from a Normal distribution with mean $$3.89 + 4.17x_{1}$$ and a standard deviation of 4.51. In the second row we have a value drawn from a Normal distribution with mean $$3.89 + 4.17x_{2}$$ and a standard deviation of 4.51. And so on. We could do this “manually” as follows:

simResp <- matrix(0, 100, 5)
for (i in 1:5) {
simResp[, i] <- rnorm(5, c(coef(mod)[1] + coef(mod)[2] * x), rep(summary(mod)$sigma, 5)) } simResp[1:6, ]  ## [,1] [,2] [,3] [,4] [,5] ## [1,] 52.52 48.93 54.80 52.14 52.90 ## [2,] 61.30 47.77 56.02 53.17 46.46 ## [3,] 57.37 53.98 53.25 46.90 63.04 ## [4,] 57.90 64.48 49.14 54.33 63.41 ## [5,] 55.30 56.91 67.99 54.80 59.03 ## [6,] 52.52 48.93 54.80 52.14 52.90  But the simulate() function is much faster and easier to use. Now 5 simulations is rather puny. A better number would be 1000. Then we can use our simulated y values to generate 1000 linear models, and thus have 1000 coefficient estimates. We can then use those estimates to create kernel density plots that allow us to visualize the amount of uncertainty in our coefficient estimates. Let's do that. Notice how I convert the output of the simulate() function to a matrix. That allows me to feed the 1000 responses to the lm() function in one fell swoop instead of coding a for loop to iterate through each column. simResp <- data.matrix(simulate(mod, 1000)) modSim <- lm(simResp ~ x)  modSim is the usual linear model object, but with the results of 1000 models instead of 1. Here are the coefficients for the first 5 models: coef(modSim)[1:2, 1:5]  ## sim_1 sim_2 sim_3 sim_4 sim_5 ## (Intercept) -0.5932 3.949 3.861 5.683 5.038 ## x 4.3836 4.166 4.199 4.062 4.113  Let's create a scatterplot of these coefficients. plot(modSim$coefficients[1, ], modSim$coefficients[2, ], xlab = "intercept", ylab = "slope", main = "Scatterplot of coefficients from simulated models")  We see that the variability of the slope estimates is pretty small (3.8 – 4.6), while the intercept estimates vary widely (from below 0 to 10). This jives with our original model (see output above) where the slope coefficient was signficant with a small standard error while the intercept was not significant and had a relatively large standard error. Next we create kernel density plots of our coefficient estimates. In addition I add a vertical line to represent the mean of the 1000 coefficient estimates and add the mean itself to the plot using the text() function. I use the difference in the range of the axes to allow me to automatically determine suitable coordinates. d1 <- density(modSim$coefficients[2, ])
d2 <- density(modSim$coefficients[1, ]) # slope plot(d1, main = "distribution of slope coefficients") abline(v = mean(modSim$coefficients[2, ]))
text(x = mean(modSim$coefficients[2, ]) + diff(range(d1$x)) * 0.05, y = diff(range(d1$y)) * 0.05, labels = round(mean(modSim$coefficients[2, ]), 2))


# intercept
plot(d2, main = "distribution of intercept coefficients")
abline(v = mean(modSim$coefficients[1, ])) text(x = mean(modSim$coefficients[1, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d2$y)) *
0.05, labels = round(mean(modSim$coefficients[1, ]), 2))  If you don't compare the scale of the x-axis in the two plots the coefficients appear to have similar variability. Perhaps a better way to do it is to put both plots in one window and set the limits to the x-axis for both plots to be the range of the estimated intercept coefficients. par(mfrow = c(2, 1)) # slope plot(d1, main = "distribution of slope coefficients", xlim = range(d2$x))
abline(v = mean(modSim$coefficients[2, ])) text(x = mean(modSim$coefficients[2, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d1$y)) *
0.05, labels = round(mean(modSim$coefficients[2, ]), 2)) # intercept plot(d2, main = "distribution of intercept coefficients", xlim = range(d2$x))
abline(v = mean(modSim$coefficients[1, ])) text(x = mean(modSim$coefficients[1, ]) + diff(range(d2$x)) * 0.05, y = diff(range(d2$y)) *
0.05, labels = round(mean(modSim$coefficients[1, ]), 2))  par(mfrow = c(1, 1))  Now we see just how uncertain our intercept estimate is compared to the slope estimate. Finally, why don't we go ahead and plot all 1000 fitted lines over the original data along with our original fitted line and the traditional 95% confidence band we get using our original model. plot(x, y, col = "red", pch = 19) abline(mod) # original fitted line apply(coef(modSim), 2, abline, col = "gray", lty = 3)  # predicted responses from original model modOrig <- predict(mod, interval = "confidence") # add 95% confidence band lines(x, modOrig[, "lwr"], col = "blue") lines(x, modOrig[, "upr"], col = "blue")  The plotting of the 1000 lines essentially provides us a confidence band for the original fitted line. We see the blue lines are slightly closer to the fitted line than the band produced by our simulated models. That's because the blue lines represent the 95% confidence band. In other words, we would expect 95% of our 1000 estimated lines to fit within those blue lines. The gray lines we see outside of the blue lines are those 5% that don't fall within the 95% confidence band. # Getting started with GitHub Gist in WordPress I’ve decided it’s time to use GitHub Gist for posting code snippets in my blog. My original motivation was better syntax highlighting. I was using WP Code Highlight, which isn’t bad. It’s easy to use and does what it promises. But it doesn’t add line numbers and the syntax highlighting doesn’t seem optimal for R. GitHub Gist on the other hand handles R code beautifully. Now highlighted code is nice, but what’s really cool about Gist is that the code is actually stored in a repository. This means I can update the code without having to update my WordPress post. And it’s easier for other people to use should they ever want to use my code. And there are other benefits as well. I guess the only bad thing is that GitHub has my code, which would be bad if they went belly-up. I’m pretty sure that won’t happen though, because I don’t want to think about it. So here’s how you use GitHub Gist with WordPress to host R code. First create an account on GitHub and go to GitHub Gist. Where it says “Name of file”, type the name of your file and add “.r” to the end. Next select R from the language pulldown, though Gist will probably select it for you if you typed a file name ending in “.r”. Now copy and paste your code into the editor and click “create public gist”. You’re done with GitbHub To make this block of code appear in your WordPress post, copy the code in the “Embed this gist” field and paste into your post. (Make sure you’re in the Text editor instead of the Visual editor). It would be really nice if that was the end of it, but it’s not. WordPress strips out the embed code and nothing happens. The workaround is to add a little plug-in called Raw HTML. As the description says, “This plugin adds a set of shortcodes that you can use to “protect” specific parts of your post and prevent WP from messing with them.” Awesome, just what we need. So the final step is to surround the Gist embed code with [raw]…[/raw] shortcode. (A tip of my hat to this site for telling me about Raw HTML. Let’s try this out. Here’s some code that samples 30 values from a N(15,2) distribution, calculates a 95% confidence interval, and plots the interval as a line. It does this 40 times and colors the CI’s that don’t capture 15 red. To make that appear in my post, I added the following: [raw] <script src=”https://gist.github.com/clayford/7842136.js”></script> [/raw] # Using a Bootstrap to Estimate Power and Significance Level I’ve been reading Common Errors in Statistics (and How to Avoid Them) by Phillip Good and James Hardin. It’s a good bathroom/bedtime book. You can pick it up and put it down as you please. Each chapter is self-contained and contains bite-size, easy-to-read sections. I’m really happy with it so far. Anyway, chapter 3 had a section on computing Power and sample size that inspired me to hop on the computer: If the data do not come from one of the preceding distributions, then we might use a bootstrap to estimate the power and signiﬁcance level. In preliminary trials of a new device, the following test results were observed: 7.0 in 11 out of 12 cases and 3.3 in 1 out of 12 cases. Industry guidelines speciﬁed that any population with a mean test result greater than 5 would be acceptable. A worst-case or boundary-value scenario would include one in which the test result was 7.0 3/7th of the time, 3.3 3/7th of the time, and 4.1 1/7th of the time. i.e., $$(7 \times \frac{3}{7}) + (3.3 \times \frac{3}{7}) + (4.1 \times \frac{1}{7}) = 5$$ The statistical procedure required us to reject if the sample mean of the test results were less than 6. To determine the probability of this event for various sample sizes, we took repeated samples with replacement from the two sets of test results. If you want to try your hand at duplicating these results, simply take the test values in the proportions observed, stick them in a hat, draw out bootstrap samples with replacement several hundred times, compute the sample means, and record the results. Well of course I want to try my hand at duplicating the results. Who wouldn’t? The idea here is to bootstrap from two samples: (1) the one they drew in the preliminary trial with mean = 6.69, and (2) the hypothetical worst-case boundary example with mean = 5. We bootstrap from each and calculate the proportion of samples with mean less than 6. The proportion of results with mean less than 6 from the first population (where true mean = 6.69) can serve as a proxy for Type I error or the significance level. This is proportion of times we make the wrong decision. We conclude the mean is less than 6 when in fact it’s really 6.69. The proportion of results with mean less than 6 from the second population (where true mean = 5) can serve as a proxy for Power. This is proportion of times we make the correct decision. We conclude the mean is less than 6 when in fact it’s really 5. In the book they show the following table of results: We see they have computed the significance level (Type I error) and power for three different sample sizes. Here’s me doing the same thing in R: # starting sample of test results (mean = 6.69) el1 <- c(7.0,3.3) prob1 <- c(11/12, 1/12) # hypothetical worst-case population (mean = 5) el2 <- c(7, 3.3, 4.1) prob2 <- c(3/7, 3/7, 1/7) n <- 1000 for (j in 3:5){ # loop through sample sizes m1 <- double(n) m2 <- double(n) for (i in 1:n) { m1[i] <- mean(sample(el1,j, replace=TRUE,prob=prob1)) # test results m2[i] <- mean(sample(el2,j, replace=TRUE,prob=prob2)) # worst-case } print(paste("Type I error for sample size =",j,"is",sum(m1 < 6.0)/n)) print(paste("Power for sample size =",j,"is",sum(m2 < 6.0)/n)) }  To begin I define vectors containing the values and their probability of occurrence. Next I set n = 1000 because I want to do 1000 bootstrap samples. Then I start the first of two for loops. The first is for my sample sizes (3 - 5) and the next is for my bootstrap samples. Each time I begin a new sample size loop I need to create two empty vectors to store the means from each bootstrap sample. I calls these m1 and m2. As I loop through my 1000 bootstrap samples, I take the mean of each sample and assign to the ith element of the m1 and m2 vectors. m1 holds the sample means from the test results and m2 holds the sample means from the worst-case boundary scenario. Finally I print the results using the paste function. Notice how I calculate the proportion. I create a logical vector by calling mx < 6.0. This returns a vector of 0s and 1s, where 0 is false and 1 is true. I then sum this vector to get the number of times the mean was less than 6. Dividing that by n (1000) gives me the proportion. Here are my results: [1] "Type I error for sample size = 3 is 0.244" [1] "Power for sample size = 3 is 0.845" [1] "Type I error for sample size = 4 is 0.04" [1] "Power for sample size = 4 is 0.793" [1] "Type I error for sample size = 5 is 0.067" [1] "Power for sample size = 5 is 0.886"  Pretty much the same thing! I guess I could have used the boot function in the boot package to do this. That’s probably more efficient. But this was a clear and easy way to duplicate their results. # Simulation to Represent Uncertainty in Regression Coefficients Working through Gelman and Hill’s book can be maddening. The exposition is wonderful but recreating their examples leads to new and exciting levels of frustration. Take the height and earnings example in chapter 4. It’s a simple linear regression of earnings on height. You’d think you could download the data from the book web site, use the code in the book, and reproduce the example. They sure lead you to think that. But it doesn’t work that way. For starters, when you load the data it has 2029 records. However the output of the regression in the book shows n = 1192. So subsetting needs to be done. As far as I can tell, the subsetting is not discussed in the book. Now the author has an “earnings” folder on his site under an “examples” directory, which contains a R file called “earnings_setup.R“. (Never mind the book itself doesn’t appear to mention this directory on the web site.) So this seems to be the place where we find out how to subset the data. The key line of code is ok <- !is.na (earn+height+sex+age) & earn>0 & yearbn>25  which creates a logical vector to subset the data. But when you run it and request the dimensions of the data frame you have 1059 records, not 1192! After trial and error I believe the subset to reproduce the results in the book should be ok <- !is.na (earn+height+sex) & earn>0  That gave me 1192. For the record, here’s my full code: heights <- read.dta ("http://www.stat.columbia.edu/~gelman/arm/examples/earnings/heights.dta") attach(heights) male <- 2 - sex # make male 1 (and therefore female = 0) ok <- !is.na (earn+height+sex) & earn>0 heights.clean <- as.data.frame (cbind (earn, height, sex, male)[ok,]) heights.clean$log.earn <- log(heights.clean$earn)  OK, so now(!) we can reproduce their example: earn.logmodel.3 <- lm (log.earn ~ height + male + height:male, data=heights.clean)  The reason I was interested in this example was for another example in chapter 7 on "using simulation to represent uncertainty in regression coefficients" (p. 142). In other words, instead of using the standard errors and intervals obtained from the predict() function in R, we compute uncertainties by simulation. It seems easy enough using the sim() function from the book's arm package. You do something like the following: library(arm) sim.1 <- sim(earn.logmodel.3, 1000)  The sim function takes two arguments: your model and the number of simulations. It returns a vector of simulated residual standard deviations and a matrix of simulated regression coefficients. We can use these to calculate confidence intervals for coefficients that do not have standard errors in the regression output. Their example asks "what can be said about the coefficient of height among men?" If you look up at the model, you can see it contains an interaction term of height and male. To answer the question we can't just look at the standard error of the height coefficient or just the interaction coefficient. There is no simple way to do what they ask using regression output. So the book instructs us to use the output of the sim function to answer this as follows: height.for.men.coef <- sim.1$beta[,2] + sim.1$beta[,4] quantile(height.for.men.coef, c(0.025,0.975))  Except that doesn't work. More frustration. It produces an error that says "Error in sim.1$beta : $operator not defined for this S4 class" (instead of giving us a 95% interval). With some googling and persistence I was able to determine that the following is how it should be done: height.for.men.coef <- sim.1@coef[,2] + sim.1@coef[,4] quantile(height.for.men.coef, c(0.025,0.975)) 2.5% 97.5% -0.0004378039 0.0507464098  Notice that "@coef" replaces "$beta". And with that I was able to finally reproduce the example I was interested in!

Now about this simulation function. While I appreciate functions that save time and make life easy, I do like to know how they work. Fortunately Gelman and Hill provide pseudo-code in the book. It goes like this:

1. Run your regression to compute the vector $\hat{\beta}$ of estimated parameters, the unscaled estimation covariance matrix $V_{\beta}$, and the residual variance $\hat{\sigma^{2}}$
2. Create n random simulations for the coefficient vector $\beta$ and residual standard deviation. For each simulation draw:
1. Simulate $\sigma = \hat{\sigma}\sqrt{(n - k)/X}$ where X is a random draw from the $\chi^{2}$ distribution with n - k degrees of freedom.
2. Given the random draw of $\sigma$, simulate $\beta$ from a multivariate normal distribution with mean $\hat{\beta}$ and variance matrix $\sigma^{2}V_{\beta}$

Not too bad. Let's use this to manually run our own simulations, so we have an idea of how the sim() function works. (Plus you may not want to use the arm package as it requires loading 9 more packages.)

Step 1 is easy enough. That's just running your regression as you normally would. Next we need to extract our estimated parameters, the unscaled covariance matrix and the residual standard deviation. We also need to snag degrees of freedom for our chi-square random draw. Here's how we can get them:

# extract coefficents
earn.logmodel.3$coef # extract residual standard error from model summary(earn.logmodel.3)$sigma

# extract unscaled covariance matrix
summary(earn.logmodel.3)$cov.unscaled # extract k and n - k; first two elements in vector summary(earn.logmodel.3)$df


Let's use this information to do a single simulation.

s.hat <- summary(earn.logmodel.3)$sigma n.minus.k <- summary(earn.logmodel.3)$df[2]
library(MASS) # need for mvrnorm function to simulate draws from multivariate normal dist'n
# simulate residual standard deviation
(s.hat.sim <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k)))
[1] 0.8591814
# use simulated residual standard deviation to simulate regression coefficients
mvrnorm(1,earn.logmodel.3$coef, s.hat.sim^2*summary(earn.logmodel.3)$cov.unscaled)
(Intercept)       height         male  height:male
7.906605029  0.025163124 -0.160828921  0.007904422


That seems to work. How about we try doing 1000 simulations. Here's one way:

n <- 1000
sim.2.sigma <- rep(NA,n)
sim.2.coef <- matrix(NA,n,4)
for (i in 1:n){
sim.2.sigma[i] <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k))
sim.2.coef[i,] <- mvrnorm(1,earn.logmodel.3$coef,sim.2.sigma[i]^2*summary(earn.logmodel.3)$cov.unscaled)
}


Now let's see how our simulation results compare to what we got using the sim() function:

height.for.men.coef.2 <- sim.2.coef[,2] + sim.2.coef[,4]
quantile(height.for.men.coef.2, c(0.025,0.975))
2.5%        97.5%
-0.001828216  0.049499381


Looks similar to what we got above. Nice. It's probably better to just use the sim() function for this sort of thing, but at least now we know a little more about what it's doing.

# Using Simulation to Compute Confidence Intervals

I’ve been working through Gelman and Hill’s book, Data Analysis Using Regression and Multilevel/Hierarchical Models. I originally wanted to read it and blog about each chapter the way I did Machine Learning for Hackers. But that book had 12 chapters. This one has 25. And the chapters are longer and denser. I don’t think I have the appetite for such an endeavor. However, I’m still working through the book and want to at least blog about certain topics that catch my eye. Today’s post comes from Chapter 2, Concepts and Methods from Basic Probability and Statistics.

In reviewing confidence intervals, the authors mention using simulation to form confidence intervals for statistics that have complicated standard error formulas. It’s one thing to compute a confidence interval for a mean. The standard error is $\frac{s}{\sqrt{n}}$. But what about a ratio? Well, that’s more difficult. Thus the book gives a nice example on how to use simulation to find a confidence interval for a ratio (p. 20). Instead of reproducing it verbatim, I thought I would apply the strategy to finding a confidence interval for a median.

My old stats textbook, Probability and Statistical Inference by Hogg and Tanis (7th edition) has a problem in section 6.10 (#7) that gives you measurements of one of the front legs of 20 different spiders and asks you to find a 95% confidence interval for the median. Now this chapter presents a method for calculating this, which is covered in depth at this very nice Penn State web site. Since the problem is odd-numbered, I can flip to the back and see the answer is (15.40, 17.05). Let’s try finding a 95% confidence interval for this data using simulation.

First off, here is the data:

x <- scan("Exercise_6_10-07.txt",skip=1)
> sort(x)
[1] 13.55 13.60 14.05 15.10 15.25 15.40 15.45 15.75 16.25 16.40 16.45 16.65
[13] 16.80 16.95 17.05 17.55 17.75 19.05 19.05 20.00


The measurements are in millimeters. The median of this data is 16.425, easily calculated in R with the median() function. Now what about a confidence interval? This is after all a sample of 20 spiders. Our median is but a point estimate of the population median, a value we will never be able to directly measure. Using R we can simulate a large number of samples by re-sampling from the data and taking the median each time, like this:

nsims <- 1000 # number of simulations
m <- rep(NA,nsims) # empty vector to store medians
for (i in 1:nsims){
m[i] <- median(sample(x,replace=TRUE))
}


When we're done, we can use the quantile() function as follows to find the 2.5 and 97.5 percentiles and thus estimate a confidence interval:

quantile(m,c(0.025,0.975))
2.5%    97.5%
15.42500 17.00125


That's very close to the given answer of (15.40, 17.05). This example would technically be filed under "bootstrap", but I think it captures the spirit of using simulation to find a confidence interval.

# Buffon’s Needle Problem, or How to use Probability to Estimate Pi

I gave a presentation on Buffon’s Needle Problem in a job interview once. Here’s the presentation I gave in PDF format if you’re interested. If you’ve never heard of Buffon’s Needle Problem, you should open my little presentation and browse through it. It’s one of the damndest things I’ve ever learned.

Here’s the setup. Imagine you have a floor with parallel lines, kind of like hardwood floors, with all lines equally spaced from one another. Let’s say the distance between the lines is L:

Now imagine you drop some needles of length L on the floor and count the instances of where the needles cross the lines:

Yes, I know, I have red needles! Cool, right? Anyway, here’s the crazy thing: if you drop a lot of needles (say 10,000) and count the number of needles crossing lines, you can use that information to estimate Pi! It turns out that if the distance between lines and the needle length are both 1, then $$\pi \approx \frac{2n}{k}$$, where n = number of needles and k = number of needles crossing lines. I don’t know about you but I think that’s nuts!

Let’s fire up R and see this in action. Here’s slightly modified code from the presentation:

a <- 1 # length of needle
L <- 1 # distance between lines
n <- 100000 # number of dropped needles
hit <- 0
for(i in 1:n) {
x <- runif(1,0,1)
y <- runif(1,0,1)
while(x^2 + y^2 > 1) { # no points outside of unit circle
x <- runif(1,0,1)
y <- runif(1,0,1)
}
theta <- atan(y/x) # the random angle
d <- runif(1,0,(L/2)) # distance of needle midpoint to nearest line
if(d <= (a/2)*sin(theta)) {
hit <- hit + 1
}
}
pi.est <- (n*2)/(hit)
pi.est


First I define the distance between the lines (L) and the needle length (a) to both be 1. They don't have to be equal, but the needle length does need to be less than or equal to the distance between the lines. It turns out that in general $$\pi \approx \frac{2na}{kL}$$. In my example, I have $$a = L = 1$$, so it simplifies to $$\pi \approx \frac{2n}{k}$$. Next I define a variable called "hit" to count the number of needles crossing a line and then I dive into a loop to simulate dropping 100,000 needles.

The first 7 lines in the loop generate a random acute angle (less than 90 degrees or $$\frac{\pi}{2}$$ radians) by way of the arctan function and x and y points that lie within the unit circle. The reason the points need to lie within the unit circle is to ensure all angles have an equal chance of being selected. The next line randomly generates a distance (d) from the midpoint of the needle to the nearest line. Using my random angle and random distance I then check to see if my needle crossed the line. If $$d \le \frac{a}{2}sin(\theta)$$ then the needle crossed the line and I increment my hit counter. In my presentation I try to justify this mathematical reasoning using pretty pictures.

Finally I calculate my Pi estimate and spit it out. I ran it just now with n = 100,000 and got 3.136517. Not an accurate estimate but pretty close. When I tried it with n = 1,000,000 I got 3.142337. The more needles, the better your estimate.

Now is this practical? Of course not. I can Google Pi and get it to 11 decimal places. R even has a built-in Pi variable (pi). But the fact we can use probability to estimate Pi just never ceases to amaze me. Nice going Buffon!