Category Archives: Power and sample size

Using a Bootstrap to Estimate Power and Significance Level

I’ve been reading Common Errors in Statistics (and How to Avoid Them) by Phillip Good and James Hardin. It’s a good bathroom/bedtime book. You can pick it up and put it down as you please. Each chapter is self-contained and contains bite-size, easy-to-read sections. I’m really happy with it so far.

Anyway, chapter 3 had a section on computing Power and sample size that inspired me to hop on the computer:

If the data do not come from one of the preceding distributions, then we might use a bootstrap to estimate the power and significance level.

In preliminary trials of a new device, the following test results were observed: 7.0 in 11 out of 12 cases and 3.3 in 1 out of 12 cases. Industry guidelines specified that any population with a mean test result greater than 5 would be acceptable. A worst-case or boundary-value scenario would include one in which the test result was 7.0 3/7th of the time, 3.3 3/7th of the time, and 4.1 1/7th of the time. i.e., \((7 \times \frac{3}{7}) + (3.3 \times \frac{3}{7}) + (4.1 \times \frac{1}{7}) = 5\)

The statistical procedure required us to reject if the sample mean of the test results were less than 6. To determine the probability of this event for various sample sizes, we took repeated samples with replacement from the two sets of test results.

If you want to try your hand at duplicating these results, simply take the test values in the proportions observed, stick them in a hat, draw out bootstrap samples with replacement several hundred times, compute the sample means, and record the results.

Well of course I want to try my hand at duplicating the results. Who wouldn’t?

The idea here is to bootstrap from two samples: (1) the one they drew in the preliminary trial with mean = 6.69, and (2) the hypothetical worst-case boundary example with mean = 5. We bootstrap from each and calculate the proportion of samples with mean less than 6. The proportion of results with mean less than 6 from the first population (where true mean = 6.69) can serve as a proxy for Type I error or the significance level. This is proportion of times we make the wrong decision. We conclude the mean is less than 6 when in fact it’s really 6.69. The proportion of results with mean less than 6 from the second population (where true mean = 5) can serve as a proxy for Power. This is proportion of times we make the correct decision. We conclude the mean is less than 6 when in fact it’s really 5.

In the book they show the following table of results:

We see they have computed the significance level (Type I error) and power for three different sample sizes. Here’s me doing the same thing in R:

# starting sample of test results (mean = 6.69)
el1 <- c(7.0,3.3)
prob1 <- c(11/12, 1/12)

# hypothetical worst-case population (mean = 5)
el2 <- c(7, 3.3, 4.1)
prob2 <- c(3/7, 3/7, 1/7)

n <- 1000
for (j in 3:5){ # loop through sample sizes
  m1 <- double(n)
  m2 <- double(n)
    for (i in 1:n) {
      m1[i] <- mean(sample(el1,j, replace=TRUE,prob=prob1)) # test results
      m2[i] <- mean(sample(el2,j, replace=TRUE,prob=prob2)) # worst-case
print(paste("Type I error for sample size =",j,"is",sum(m1 < 6.0)/n)) 
print(paste("Power for sample size =",j,"is",sum(m2 < 6.0)/n))

To begin I define vectors containing the values and their probability of occurrence. Next I set n = 1000 because I want to do 1000 bootstrap samples. Then I start the first of two for loops. The first is for my sample sizes (3 - 5) and the next is for my bootstrap samples. Each time I begin a new sample size loop I need to create two empty vectors to store the means from each bootstrap sample. I calls these m1 and m2. As I loop through my 1000 bootstrap samples, I take the mean of each sample and assign to the ith element of the m1 and m2 vectors. m1 holds the sample means from the test results and m2 holds the sample means from the worst-case boundary scenario. Finally I print the results using the paste function. Notice how I calculate the proportion. I create a logical vector by calling mx < 6.0. This returns a vector of 0s and 1s, where 0 is false and 1 is true. I then sum this vector to get the number of times the mean was less than 6. Dividing that by n (1000) gives me the proportion. Here are my results:

[1] "Type I error for sample size = 3 is 0.244"
[1] "Power for sample size = 3 is 0.845"
[1] "Type I error for sample size = 4 is 0.04"
[1] "Power for sample size = 4 is 0.793"
[1] "Type I error for sample size = 5 is 0.067"
[1] "Power for sample size = 5 is 0.886"

Pretty much the same thing! I guess I could have used the boot function in the boot package to do this. That’s probably more efficient. But this was a clear and easy way to duplicate their results.

The power function

Power is the probability of rejecting a null hypothesis when the alternative is true. Say your null hypothesis is \( \mu \le 55\) and that your alternative hypothesis is \( \mu > 55\). Further, say the true state of the world is such that \( \mu = 60\). The probability we reject \( \mu \le 55\) given that state of the world, that is make the correct decision, is called Power.

Let’s say we carry out such a test but we don’t know the true state of the world (we never do in real life). We sample 20 items and decide to reject the null of \( \mu \le 55\) if \( \overline{x}\ge 60\). We’ll assume we’re dealing with a Normal distribution that has an unknown mean \( \mu\) and standard deviation of \( \sigma=5\). What is the power of such a test? Well, it depends on what \( \mu\) really is. For example, if \( \mu=61\), then the power of the test is

$$ 1 – \Phi\left(\frac{60-61}{5/\sqrt{20}}\right) = 1 – 0.186 = 0.814$$

This says we have about a 81% chance of getting a sample mean of 60 (or higher) if the true mean is 61. That’s pretty good power. Traditionally statisticians like to have at least 80% power when doing experiments. Of course the catch is you have to know the standard deviation. Is that even possible? Not really. What most people do is make the best estimate possible and err on the side of being too conservative. If they think the standard deviation is about 2.5, they’ll round it up to 3 to be safe. Now as your standard deviation increases, your power decreases. So being conservative means you have to increase your sample size to get the power back up to a desirable level. Going back to example, let’s say our standard deviation is 6 instead of 5:

$$ 1 – \Phi\left(\frac{60-61}{6/\sqrt{20}}\right) = 0.772$$

Notice how the power dropped to 77%. To increase it back to around 80% I can increase my sample size. To do so in this example I need to increase my sample size from 20 to 27:

$$ 1 – \Phi\left(\frac{60-61}{6/\sqrt{27}}\right) = 0.807$$

I could also hypothesize a different true mean in order to increase power. Previously I assumed a true mean of 60. If I leave my sample size at 20 and assume the larger standard deviation of 6, I can obtain 80% power by hypothesizing a mean of 61.15:

$$ 1 – \Phi\left(\frac{60-61.15}{6/\sqrt{20}}\right) = 0.804$$

These formulas I’ve been using are power functions and they cry out for a spreadsheet. In one column enter various hypothesized means and then run a power function down an adjacent column to see how the power changes. You can do the same with sample size.

Here’s one trying different means:

Here’s another trying different sample sizes:

So we see that the further away the true mean is from our cut-off point, or the bigger our sample, the higher our power. This is a very useful and practical exercise for planning an experiment. Using some conservative assumptions, we can ballpark a good sample size. A size that’s not too small (i.e., under-powered) nor a size that’s not too big. If our sample size is too small, we’ll have a low probability of rejecting a false null hypothesis. If our sample size is too big, we spend unnecessary time and money and effort on our experiment.