I’ve been working through Gelman and Hill’s book, *Data Analysis Using Regression and Multilevel/Hierarchical Models*. I originally wanted to read it and blog about each chapter the way I did *Machine Learning for Hackers*. But that book had 12 chapters. This one has 25. And the chapters are longer and denser. I don’t think I have the appetite for such an endeavor. However, I’m still working through the book and want to at least blog about certain topics that catch my eye. Today’s post comes from Chapter 2, Concepts and Methods from Basic Probability and Statistics.

In reviewing confidence intervals, the authors mention using simulation to form confidence intervals for statistics that have complicated standard error formulas. It’s one thing to compute a confidence interval for a mean. The standard error is . But what about a ratio? Well, that’s more difficult. Thus the book gives a nice example on how to use simulation to find a confidence interval for a ratio (p. 20). Instead of reproducing it verbatim, I thought I would apply the strategy to finding a confidence interval for a median.

My old stats textbook, *Probability and Statistical Inference* by Hogg and Tanis (7th edition) has a problem in section 6.10 (#7) that gives you measurements of one of the front legs of 20 different spiders and asks you to find a 95% confidence interval for the median. Now this chapter presents a method for calculating this, which is covered in depth at this very nice Penn State web site. Since the problem is odd-numbered, I can flip to the back and see the answer is (15.40, 17.05). Let’s try finding a 95% confidence interval for this data using simulation.

First off, here is the data:

x <- scan("Exercise_6_10-07.txt",skip=1) > sort(x) [1] 13.55 13.60 14.05 15.10 15.25 15.40 15.45 15.75 16.25 16.40 16.45 16.65 [13] 16.80 16.95 17.05 17.55 17.75 19.05 19.05 20.00

The measurements are in millimeters. The median of this data is 16.425, easily calculated in R with the median() function. Now what about a confidence interval? This is after all a sample of 20 spiders. Our median is but a point estimate of the population median, a value we will never be able to directly measure. Using R we can simulate a large number of samples by re-sampling from the data and taking the median each time, like this:

nsims <- 1000 # number of simulations m <- rep(NA,nsims) # empty vector to store medians for (i in 1:nsims){ m[i] <- median(sample(x,replace=TRUE)) }

When we're done, we can use the quantile() function as follows to find the 2.5 and 97.5 percentiles and thus estimate a confidence interval:

quantile(m,c(0.025,0.975)) 2.5% 97.5% 15.42500 17.00125

That's very close to the given answer of (15.40, 17.05). This example would technically be filed under "bootstrap", but I think it captures the spirit of using simulation to find a confidence interval.