Monthly Archives: March 2013

The standard deviation of the sampling distribution of the mean

Someone on Reddit posted Chapter 1 of Howard Wainer’s book, Picturing the Uncertain World. The name of the chapter is The Most Dangerous Equation and its subject is the formula for the standard deviation of the sampling distribution of the mean. This little guy:

\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}

Why is it dangerous? The danger comes from not understanding it. Wainer gives several examples of how ignorance of this equation has “led to billions of dollars of loss over centuries, yielding untold hardship.” But before diving into the examples he briefly describes the formula. It’s not a proof but rather a heuristic demonstration of how it works:

…if we measure, for example, the heights of, say, 1000 students at a particular high school, we might find that the average height is 67 inches, but heights might range from perhaps as little as 55 inches to as much as 80 inches. A number that characterizes this variation is the standard deviation….But now suppose we randomly grouped the 1000 children into 10 groups of 100 and calculated the average within each group. The variation of these 10 averages would likely be much smaller than [the standard deviation of all 1000 students]…

This got me to thinking about how I could demonstrate this in R. It seems like we could generate a population, then take repeated samples of size n to create a sampling distribution, and then show that the standard deviation of the sampling distribution is indeed equal to the population standard deviation divided by n. Let’s do it.

First I load the gtools package which has a function called permutations(). As you might guess, it takes a vector of values and generates all permutations of a given size. After that I generate a “population” of 10 values using the rnorm() function. So my 10 values come from a normal distribution with mean 10 and standard deviation 2, but I’m treating these 10 values as if they’re my entire population for this toy example.

library(gtools)
# generate population of size 10
pop <- rnorm(10,10,2)

Now we're ready to generate the sampling distribution. I decided to let n = 3. This means we need to generate every possible sample of size 3. This is where the permutations() function comes. The first argument tells it how big the source vector is, the second states the size of the target vectors, and the third is the source vector. The last tells it to allow repeats. This is important. This replicates sampling with replacement, which is necessary to demonstrate this formula using a finite population.

# generate the sampling distribution
# first generate all samples of size 3
sdist <- permutations(10,3,pop,repeats=TRUE)

If you look at the variable sdist, you'll see it's a 1000 x 3 matrix. That's 1000 permutations of size 3 from our original population. Next we take the mean of each row (or sample of size 3):

# find the mean of all samples (in rows)
sdm <- apply(sdist,1,mean)

The variable "sdm" is our sampling distribution of the mean. We took every possible sample of size 3 from our population and calculated the mean of each sample. Now the first thing to note is that the mean of the sampling distribution is equal to the mean of our population:

mean(pop) == mean(sdm)
[1] TRUE

But what I really wanted to verify was the formula \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}, or equivalently \sigma_{\bar{x}}^{2} = \frac{\sigma^{2}}{n}:

sum(sdm^2)/1000 - (mean(sdm)^2) == (sum(pop^2)/10 - (mean(pop)^2))/3
[1] TRUE

Aha! It's true! Of course this isn't a proof, but it came out like we expected. Notice I had to manually find the population variance in each case using \sigma^{2} = E(X^{2}) - \mu^{2}. That's because the var() and sd() functions in R divide by n-1.

Using Simulation to Compute Confidence Intervals

I’ve been working through Gelman and Hill’s book, Data Analysis Using Regression and Multilevel/Hierarchical Models. I originally wanted to read it and blog about each chapter the way I did Machine Learning for Hackers. But that book had 12 chapters. This one has 25. And the chapters are longer and denser. I don’t think I have the appetite for such an endeavor. However, I’m still working through the book and want to at least blog about certain topics that catch my eye. Today’s post comes from Chapter 2, Concepts and Methods from Basic Probability and Statistics.

In reviewing confidence intervals, the authors mention using simulation to form confidence intervals for statistics that have complicated standard error formulas. It’s one thing to compute a confidence interval for a mean. The standard error is \frac{s}{\sqrt{n}} . But what about a ratio? Well, that’s more difficult. Thus the book gives a nice example on how to use simulation to find a confidence interval for a ratio (p. 20). Instead of reproducing it verbatim, I thought I would apply the strategy to finding a confidence interval for a median.

My old stats textbook, Probability and Statistical Inference by Hogg and Tanis (7th edition) has a problem in section 6.10 (#7) that gives you measurements of one of the front legs of 20 different spiders and asks you to find a 95% confidence interval for the median. Now this chapter presents a method for calculating this, which is covered in depth at this very nice Penn State web site. Since the problem is odd-numbered, I can flip to the back and see the answer is (15.40, 17.05). Let’s try finding a 95% confidence interval for this data using simulation.

First off, here is the data:

x <- scan("Exercise_6_10-07.txt",skip=1)
> sort(x)
 [1] 13.55 13.60 14.05 15.10 15.25 15.40 15.45 15.75 16.25 16.40 16.45 16.65
[13] 16.80 16.95 17.05 17.55 17.75 19.05 19.05 20.00

The measurements are in millimeters. The median of this data is 16.425, easily calculated in R with the median() function. Now what about a confidence interval? This is after all a sample of 20 spiders. Our median is but a point estimate of the population median, a value we will never be able to directly measure. Using R we can simulate a large number of samples by re-sampling from the data and taking the median each time, like this:

nsims <- 1000 # number of simulations
m <- rep(NA,nsims) # empty vector to store medians
for (i in 1:nsims){  
	m[i] <- median(sample(x,replace=TRUE))
	}

When we're done, we can use the quantile() function as follows to find the 2.5 and 97.5 percentiles and thus estimate a confidence interval:

quantile(m,c(0.025,0.975))
    2.5%    97.5% 
15.42500 17.00125 

That's very close to the given answer of (15.40, 17.05). This example would technically be filed under "bootstrap", but I think it captures the spirit of using simulation to find a confidence interval.