October | 2011 | statistics you can probably trust

Maximum likelihood is one of those topics in mathematical statistics that takes a while to wrap your head around. At first glance it seems to be making something that seems easy (estimating a parameter with a statistic) into something way more complicated than it needs to be. For example, a frequent exercise is to find the maximum likelihood estimator of the mean of a normal distribution. You take the product of the n normal pdfs, take the log of that, find the first derivative, set it equal to 0 and solve. You find out that it’s $ \bar{x}$, i.e., the average. Yes, it’s good to know theoretically why we use $ \bar{x}$ to estimate the mean, but why would we use anything else? To me, way back when, it was akin to a long elaborate Rube Goldberg process to show that 2 + 2 equaled 4. I didn’t see the use of it. Of course if you stick with statistics long enough you find that maximum likelihood is indeed very useful, especially for proving results regarding efficiency and sufficiency.

Anyway, one result of maximum likelihood that baffled me for the longest time was the variance of a maximum likelihood estimator. It’s this:

$$ Var(\widehat{\theta}) \approx 1/-E(\frac{d^{2}logL}{d\theta^{2}})$$

To me this had no intuition at all. In fact it still doesn’t. However it works. Now many statistics books will go over determining the maximum likelihood estimator in painstaking detail, but then they’ll blow through the variance of the estimator in a few lines. The purpose of this post is to hopefully fill in those gaps.

It all starts with Taylor’s theorem, which says (in so many words):

$$ f(x) \approx f(a) + f'(a)(x-a) + \text{remainder}$$

We’re interested in approximating the variance, so we forget about the remainder. The rest is another (approximate) way to express $ f(x)$. In this case we have $ f(x) = \frac{dlogL}{d\theta}$ evaluated at $ \widehat{\theta}$ as our function, and $ x = \widehat{\theta}$ and $ a = \theta$. Plug in and we have:

$$ f(\widehat{\theta}) \approx \frac{dlogL}{d\theta} + \frac{d^{2}logL}{d\theta^{2}}(\widehat{\theta}-\theta)$$

The first thing to note here is that $ f(\widehat{\theta})=0$. Why? Because $ \widehat{\theta}$ is the solution to $ \frac{dlogL}{d\theta} = 0$. Plugging $ \widehat{\theta}$ into the function produces 0. Substituting that in and doing some algebra we get

$$ \widehat{\theta} \approx -\frac{dlogL}{d\theta} / \frac{d^{2}logL}{d\theta^{2}} + \theta$$

Now let’s take the variance of that expression:

$$ Var(\widehat{\theta}) \approx Var(-\frac{dlogL}{d\theta} / \frac{d^{2}logL}{d\theta^{2}} + \theta)$$

Wait, hold up. $ \frac{d^{2}logL}{d\theta^{2}}$ is the second derivative evaluated at $ \theta $. We don’t know $ \theta $. We had to approximate it with $ \widehat{\theta}$. So let’s substitute in the expected value, like so:

$$ Var(-\frac{dlogL}{d\theta} / E(\frac{d^{2}logL}{d\theta^{2}}) + \theta)$$

Now take the variance:

$$ Var(\widehat{\theta}) \approx 1/(E(\frac{d^{2}logL}{d\theta^{2}}))^{2} \times Var(\frac{dlogL}{d\theta} + \theta)$$

OK, getting closer. It’s starting to look like the result I showed at the beginning. We still need to find $ Var(\frac{dlogL}{d\theta} + \theta)$. Actually we just need to find $ Var(\frac{dlogL}{d\theta})$ since $ \theta $ is a constant and does not impact the variance calculation.

First let’s recall the useful formula $ Var(x) = E(x^{2}) – (E(x))^{2}$. Let’s use that here:

$$ Var(\frac{dlogL}{d\theta}) = E((\frac{dlogL}{d\theta})^{2}) – (E(\frac{dlogL}{d\theta}))^{2}$$

Now it turns out that the $ E(\frac{dlogL}{d\theta}) = 0$. A quick proof can be found on p. 202 of Bulmer’s Principles of Statistics. I would happily reproduce it here but I think it detracts from my goal. So believe me. $ E(\frac{dlogL}{d\theta}) = 0$. That changes our formula to

$$ Var(\frac{dlogL}{d\theta}) = E((\frac{dlogL}{d\theta})^{2}) $$

Once again, we have to back up and make yet another observation. Notice the following:

$$ \frac{d^{2}logL}{d\theta^{2}} = \frac{d}{d\theta}(\frac{1}{L}\frac{dL}{d\theta})=-\frac{1}{L^{2}}(\frac{dL}{d\theta})^{2} + \frac{1}{L}\frac{d^{2}L}{d\theta^{2}}$$

Recall that $ \frac{d log L}{d\theta}=\frac{1}{L}\frac{dL}{d\theta}$. Rearrange and we get $ \frac{dL}{d\theta}=\frac{d log L}{d\theta}L$. Substitute that in to our previous formula:

$$ \frac{d^{2}logL}{d\theta^{2}} =-\frac{1}{L^{2}}(\frac{d log L}{d\theta}L)^{2} + \frac{1}{L}\frac{d^{2}L}{d\theta^{2}}$$

$$ \frac{d^{2}logL}{d\theta^{2}} =-(\frac{d log L}{d\theta})^{2} + \frac{1}{L}\frac{d^{2}L}{d\theta^{2}}$$

Now rearrange as follows and see what we have:

$$ -\frac{d^{2}logL}{d\theta^{2}} + \frac{1}{L}\frac{d^{2}L}{d\theta^{2}} =(\frac{d log L}{d\theta})^{2} $$

Look at our variance formula we were working on:

$$ Var(\frac{dlogL}{d\theta}) = E((\frac{dlogL}{d\theta})^{2}) $$

See where we can make the substitution? Let’s do it:

$$ Var(\frac{dlogL}{d\theta}) = E(-\frac{d^{2}logL}{d\theta^{2}} + \frac{1}{L}\frac{d^{2}L}{d\theta^{2}}) $$

The expected value of the second term is 0 for the same reason that $ E(\frac{dlogL}{d\theta}) = 0$. Take my word for it. That leaves us with…

$$ Var(\frac{dlogL}{d\theta}) = E(-\frac{d^{2}logL}{d\theta^{2}}) = -E(\frac{d^{2}logL}{d\theta^{2}})$$

OK, we’re ALMOST THERE! Now bring back the full variance expression we had earlier…

$$ Var(\widehat{\theta}) \approx 1/(E(\frac{d^{2}logL}{d\theta^{2}}))^{2} \times Var(\frac{dlogL}{d\theta} + \theta)$$

…and plug what we just found:

$$ Var(\widehat{\theta}) \approx 1/(E(\frac{d^{2}logL}{d\theta^{2}}))^{2} \times -E(\frac{d^{2}logL}{d\theta^{2}})$$

Do the cancellation and we get the final reduced expression for the variance of the maximum likelihood estimator:

$$ Var(\widehat{\theta}) \approx 1/-E(\frac{d^{2}logL}{d\theta^{2}})$$

Power is the probability of rejecting a null hypothesis when the alternative is true. Say your null hypothesis is $ \mu \le 55$ and that your alternative hypothesis is $ \mu > 55$. Further, say the true state of the world is such that $ \mu = 60$. The probability we reject $ \mu \le 55$ given that state of the world, that is make the correct decision, is called Power.

Let’s say we carry out such a test but we don’t know the true state of the world (we never do in real life). We sample 20 items and decide to reject the null of $ \mu \le 55$ if $ \overline{x}\ge 60$. We’ll assume we’re dealing with a Normal distribution that has an unknown mean $ \mu$ and standard deviation of $ \sigma=5$. What is the power of such a test? Well, it depends on what $ \mu$ really is. For example, if $ \mu=61$, then the power of the test is

$$ 1 – \Phi\left(\frac{60-61}{5/\sqrt{20}}\right) = 1 – 0.186 = 0.814$$

This says we have about a 81% chance of getting a sample mean of 60 (or higher) if the true mean is 61. That’s pretty good power. Traditionally statisticians like to have at least 80% power when doing experiments. Of course the catch is you have to know the standard deviation. Is that even possible? Not really. What most people do is make the best estimate possible and err on the side of being too conservative. If they think the standard deviation is about 2.5, they’ll round it up to 3 to be safe. Now as your standard deviation increases, your power decreases. So being conservative means you have to increase your sample size to get the power back up to a desirable level. Going back to example, let’s say our standard deviation is 6 instead of 5:

$$ 1 – \Phi\left(\frac{60-61}{6/\sqrt{20}}\right) = 0.772$$

Notice how the power dropped to 77%. To increase it back to around 80% I can increase my sample size. To do so in this example I need to increase my sample size from 20 to 27:

$$ 1 – \Phi\left(\frac{60-61}{6/\sqrt{27}}\right) = 0.807$$

I could also hypothesize a different true mean in order to increase power. Previously I assumed a true mean of 60. If I leave my sample size at 20 and assume the larger standard deviation of 6, I can obtain 80% power by hypothesizing a mean of 61.15:

$$ 1 – \Phi\left(\frac{60-61.15}{6/\sqrt{20}}\right) = 0.804$$

These formulas I’ve been using are power functions and they cry out for a spreadsheet. In one column enter various hypothesized means and then run a power function down an adjacent column to see how the power changes. You can do the same with sample size.

Here’s one trying different means:

Here’s another trying different sample sizes:

So we see that the further away the true mean is from our cut-off point, or the bigger our sample, the higher our power. This is a very useful and practical exercise for planning an experiment. Using some conservative assumptions, we can ballpark a good sample size. A size that’s not too small (i.e., under-powered) nor a size that’s not too big. If our sample size is too small, we’ll have a low probability of rejecting a false null hypothesis. If our sample size is too big, we spend unnecessary time and money and effort on our experiment.

statistics you can probably trust

Help with learning statistics

Monthly Archives: October 2011

The variance of a maximum likelihood estimator

The power function