Someone on Reddit posted Chapter 1 of Howard Wainer’s book, Picturing the Uncertain World. The name of the chapter is The Most Dangerous Equation and its subject is the formula for the standard deviation of the sampling distribution of the mean. This little guy:
Why is it dangerous? The danger comes from not understanding it. Wainer gives several examples of how ignorance of this equation has “led to billions of dollars of loss over centuries, yielding untold hardship.” But before diving into the examples he briefly describes the formula. It’s not a proof but rather a heuristic demonstration of how it works:
…if we measure, for example, the heights of, say, 1000 students at a particular high school, we might find that the average height is 67 inches, but heights might range from perhaps as little as 55 inches to as much as 80 inches. A number that characterizes this variation is the standard deviation….But now suppose we randomly grouped the 1000 children into 10 groups of 100 and calculated the average within each group. The variation of these 10 averages would likely be much smaller than [the standard deviation of all 1000 students]…
This got me to thinking about how I could demonstrate this in R. It seems like we could generate a population, then take repeated samples of size n to create a sampling distribution, and then show that the standard deviation of the sampling distribution is indeed equal to the population standard deviation divided by n. Let’s do it.
First I load the gtools package which has a function called permutations(). As you might guess, it takes a vector of values and generates all permutations of a given size. After that I generate a “population” of 10 values using the rnorm() function. So my 10 values come from a normal distribution with mean 10 and standard deviation 2, but I’m treating these 10 values as if they’re my entire population for this toy example.
library(gtools) # generate population of size 10 pop <- rnorm(10,10,2)
Now we're ready to generate the sampling distribution. I decided to let n = 3. This means we need to generate every possible sample of size 3. This is where the permutations() function comes. The first argument tells it how big the source vector is, the second states the size of the target vectors, and the third is the source vector. The last tells it to allow repeats. This is important. This replicates sampling with replacement, which is necessary to demonstrate this formula using a finite population.
# generate the sampling distribution # first generate all samples of size 3 sdist <- permutations(10,3,pop,repeats=TRUE)
If you look at the variable sdist, you'll see it's a 1000 x 3 matrix. That's 1000 permutations of size 3 from our original population. Next we take the mean of each row (or sample of size 3):
# find the mean of all samples (in rows) sdm <- apply(sdist,1,mean)
The variable "sdm" is our sampling distribution of the mean. We took every possible sample of size 3 from our population and calculated the mean of each sample. Now the first thing to note is that the mean of the sampling distribution is equal to the mean of our population:
mean(pop) == mean(sdm) [1] TRUE
But what I really wanted to verify was the formula , or equivalently :
sum(sdm^2)/1000 - (mean(sdm)^2) == (sum(pop^2)/10 - (mean(pop)^2))/3 [1] TRUE
Aha! It's true! Of course this isn't a proof, but it came out like we expected. Notice I had to manually find the population variance in each case using . That's because the var() and sd() functions in R divide by n-1.