Monthly Archives: August 2011

The moment generating function of the gamma

The moment-generating function of the gamma distribution is M(t) = (1-\theta t)^{-\alpha}. Such a friendly little guy. Not what you would expect when you start with this:

M(t) = \int_{0}^{\infty}e^{tx}\frac{1}{\Gamma (\alpha)\theta^{\alpha}}x^{\alpha - 1}e^{-x/\theta}dx

How do we get there? First let’s combine the two exponential terms and move the gamma fraction out of the integral:

M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}x^{\alpha - 1}e^{tx-x/\theta}dx

Multiply tx in the exponential by \frac{\theta}{\theta} = 1, add the two terms together and factor out -x:

M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}x^{\alpha - 1}e^{-x(1-\theta t)/\theta}dx

Now we’re ready to do so substitution in the integral. Let u = \frac{x(1-\theta t)}{\theta}. That means we have du = \frac{1-\theta t}{\theta}dx. Now solve those for x and dx, respectively. We get x = \frac{\theta u}{1-\theta t} and dx = \frac{\theta}{1-\theta t}du. Now substitute those in the integral to see something that looks truly awful:

M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}\frac{\theta u}{1-\theta t}^{\alpha - 1}e^{-u}\frac{\theta}{1-\theta t}du

If re-write everything inside the integral with exponents we can do some cancellation and make this integral more attractive:

M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}(1-\theta t)^{-\alpha + 1}(1-\theta t)^{-1}(\theta u)^{\alpha-1}\theta e^{-u}du

Using the rules of exponents we see that (1-\theta t)^{-\alpha + 1}(1-\theta t)^{-1} = (1- \theta t)^{-\alpha} which can be moved out of the integral since it doesn’t have u in it:

M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}(\theta u)^{\alpha-1}\theta e^{-u}du

Using the rules of exponents again we see that \theta^{\alpha -1}\theta = \theta^{\alpha}:

M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}\theta^{\alpha}u^{\alpha - 1} e^{-u}du

Now notice we can cancel out the two \theta^{\alpha} terms:

M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)}\int_{0}^{\infty}u^{\alpha - 1} e^{-u}du

The integral is now the gamma function: \Gamma (\alpha) = \int_{0}^{\infty}u^{\alpha - 1} e^{-u}du. Make that substitution:

M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)}\Gamma(\alpha)

Cancel out the \Gamma(\alpha) terms and we have our nice-looking moment-generating function:

M(t) = (1- \theta t)^{-\alpha}

If we take the derivative of this function and evaluate at 0 we get the mean of the gamma distribution:

M'(t) = -\alpha (1-\theta t)^{-\alpha -1}(-\theta) M'(0) = -\alpha(1 - 0)^{-\alpha -1}(-\theta) = \alpha\theta = E(X)

Recall that \theta is the mean time between events and \alpha is the number of events. Multiply them together and you have the mean. This makes sense. If the mean wait time between events is 2 minutes (\theta = 2) and we’re interested in the distribution of times until 4 events occur (\alpha = 4), it seems reasonable to think the mean wait time to observe the 4th event would be 8 minutes (4(2)).

If we take the second derivative of the moment-generating function and evaluate at 0, we get the second moment about the origin which we can use to find the variance:

M''(t) = (-\alpha - 1)\theta\alpha(1-\theta t)^{-\alpha - 2}(-\theta) M''(0) = (-\alpha - 1)\theta\alpha(1-0)^{-\alpha - 2}(-\theta) = (-\alpha - 1)(-\theta^{2})\alpha

Now find the variance:

Var(X) = (-\alpha - 1)(-\theta^{2})\alpha - (\alpha\theta)^{2} Var(X) = \alpha^{2}\theta^{2} + \alpha\theta^{2} - \alpha^{2}\theta^{2} Var(X) = \alpha\theta^{2}

Going back to our example with \alpha = 4 (number of events) and \theta=2 (mean time between events), we have as our variance 4(2^{2})=16. The square root of that gives our standard deviation as 4.

Plugging the mean and standard deviation into the dgamma function in R, we can plot this particular gamma distribution:

xval <- seq(0,30, by = 0.01)
ga.4 <- dgamma(xval, shape=4, scale=2)
plot(xval,ga.4,col=1, type = "l", ylab="f(x)", xlab="x")
title(main="Gamma Distribution, scale = 2, shape = 4")

gamma with scale = 2 and shape = 4

Deriving the gamma distribution

Let’s take a look at the gamma distribution:

f(x) = \frac{1}{\Gamma(\alpha)\theta^{\alpha}}x^{\alpha-1}e^{-x/\theta}

I don’t know about you, but I think it looks pretty horrible. Typing it up in Latex is no fun and gave me second thoughts about writing this post, but I’ll plow ahead.

First, what does it mean? One interpretation of the gamma distribution is that it’s the theoretical distribution of waiting times until the \alpha-th change for a Poisson process. In another post I derived the exponential distribution, which is the distribution of times until the first change in a Poisson process. The gamma distribution models the waiting time until the 2nd, 3rd, 4th, 38th, etc, change in a Poisson process.

As we did with the exponential distribution, we derive it from the Poisson distribution. Let W be the random variable the represents waiting time. Its cumulative distribution function then would be

F(w) = P(W \le w) = 1 -P(W > w)

But notice that P(W > w) is the probability of fewer than \alpha changes in the interval [0, w]. The probability of that in a Poisson process with mean \lambda w is

= 1 - \sum_{k=0}^{\alpha-1}\frac{(\lambda w)^{k}e^{- \lambda w}}{k!}

To find the probability distribution function we take the derivative of F(w). But before we do that we can simplify matters a little by expanding the summation to two terms:

= 1 - \frac{(\lambda w)^{0}e^{-\lambda w}}{0!}-\sum_{k=1}^{\alpha-1}\frac{(\lambda w)^{k}e^{- \lambda w}}{k!} = 1 - e^{-\lambda w}-\sum_{k=1}^{\alpha-1}\frac{(\lambda w)^{k}e^{- \lambda w}}{k!}

Why did I know to do that? Because my old statistics book did it that way. Moving on…

F'(x) = 0 - e^{-\lambda w}(-\lambda)-\sum_{k=1}^{\alpha-1}\frac{(k!(\lambda w)^{k}e^{- \lambda w}+e^{- \lambda w}k( \lambda w)^{k-1}\lambda}{(k!)^{2}}

After lots of simplifying…

= \lambda e^{-\lambda w}+\lambda e^{-\lambda w}\sum_{k=1}^{\alpha-1}[\frac{(\lambda w)^{k}}{k!}-\frac{(\lambda w)^{k-1}}{(k-1)!}]

And we’re done! Technically we have the gamma probability distribution. But it’s a little too bloated for mathematical taste. And of course it doesn’t match the form of the gamma distribution I presented in the beginning, so we have some more simplifying to do. Let’s carry out the summation for a few terms and see what happens:

\sum_{k=1}^{\alpha-1}[\frac{(\lambda w)^{k}}{k!}-\frac{(\lambda w)^{k-1}}{(k-1)!}] = [\lambda w - 1] + [\frac{(\lambda w)^{2}}{2!}-\lambda w]+[\frac{(\lambda w)^{3}}{3!} - \frac{(\lambda w)^{2}}{2!}] +[\frac{(\lambda w)^{4}}{4!} - \frac{(\lambda w)^{3}}{3!}] + \ldots + [\frac{(\lambda w)^{\alpha - 2}}{(\alpha - 2)!} - \frac{(\lambda w)^{\alpha - 3}}{(\alpha - 3)!}] + [\frac{(\lambda w)^{\alpha - 1}}{(\alpha - 1)!} - \frac{(\lambda w)^{\alpha - 2}}{(\alpha - 2)!}]

Notice that besides the -1 and 2nd to last term, everything cancels, so we’re left with

= -1 + \frac{(\lambda w)^{\alpha -1}}{(\alpha -1)!}

Plugging that back into the gamma pdf gives us

= \lambda e^{-\lambda w} + \lambda e^{-\lambda w}[-1 + \frac{(\lambda w)^{\alpha -1}}{(\alpha -1)!}]

This simplifies to

=\frac{\lambda (\lambda w)^{\alpha -1}}{(\alpha -1)!}e^{-\lambda w}

Now that’s a lean formula, but still not like the one I showed at the beginning. To get the “classic” formula we do two things:

  1. Let \lambda = \frac{1}{\theta}, just as we did with the exponential
  2. Use the fact that (\alpha -1)! = \Gamma(\alpha)

Doing that takes us to the end:

= \frac{\frac{1}{\theta}(\frac{w}{\theta})^{\alpha-1}}{\Gamma(\alpha)}e^{-w/\theta} = \frac{1}{\theta}(\frac{1}{\theta})^{\alpha -1}w^{\alpha - 1}\frac{1}{\Gamma(\alpha)}e^{-w/\theta} = (\frac{1}{\theta})^{\alpha}\frac{1}{\Gamma(\alpha)}w^{\alpha -1}e^{w/\theta} = \frac{1}{\Gamma(\alpha)\theta^{\alpha}}w^{\alpha-1}e^{-w/\theta}

We call \alpha the shape parameter and \theta the scale parameter because of their effect on the shape and scale of the distribution. Holding \theta (scale) at a set value and trying different values of \alpha (shape) changes the shape of the distribution (at least when you go from \alpha =1 to \alpha =2:

gamma - hold scale, change shape

Holding \alpha (shape) at a set value and trying different values of \theta (scale) changes the scale of the distribution:

gamma - hold shape, change scale

In the applied setting \theta (scale) is the mean wait time between events and \alpha is the number of events. If we look at the first figure above, we’re holding the wait time at 1 and changing the number of events. We see that the probability of waiting 5 minutes or longer increases as the number of events increases. This is intuitive as it would seem more likely to wait 5 minutes to observe 4 events than to wait 5 minutes to observe 1 event, assuming a one minute wait time between each event. The second figure holds the number of events at 4 and changes the wait time between events. We see that the probability of waiting 10 minutes or longer increases as the time between events increases. Again this is pretty intuitive as you would expect a higher probability of waiting more than 10 minutes to observe 4 events when there is mean wait time of 4 minutes between events versus a mean wait time of 1 minute.

Finally notice that if you set \alpha = 1, the gamma distribution simplifies to the exponential distribution.

Update 5 Oct 2013: I want to point out that \alpha and \beta can take continuous values like 2.3, not just integers. So what I’ve really derived in this post is the relationship between the gamma and Poisson distributions.

The forgetful exponential distribution

The exponential distribution has the quirky property of having no memory. Before we wade into the math and see why, let’s consider a situation where there is memory: drawing cards. Let’s say you have a well-shuffled deck of 52 cards and you draw a single card. What’s the probability of drawing an ace? Since there are 4 aces in a deck of 52 cards, the probability is \frac{4}{52}. We draw our card and it’s not an ace. We set the card aside, away from the deck, and draw again. Now our probability of drawing an ace is \frac{4}{51}. We have a slightly better chance on the 2nd draw. The condition that we have already selected a card that wasn’t an ace changes the probability we draw an ace. This doesn’t happen with the exponential distribution.

Let’s say we have a state-of-the-art widget (version 2.0) that has a lifespan that can be described with an exponential distribution. Further, let’s say the mean lifespan is 60 months, or 5 years. Thanks to the “no memory” property, the probability of the lifespan lasting 7 years is that same whether the widget is new or 5 years old. In math words:

P(X > 7 + 5 | X>5) = P(X>7)

That means if I bought a widget that was 5 years old, it has the same probability of lasting another 7 years as a brand new widget has for lasting 7 years. Not realistic but certainly interesting. Showing why this is the case is actually pretty straight-ahead.

We want to show that for the exponential distribution, P(X > x + y | X > x) = P(X > y).

Recall the cumulative distribution of an exponential distribution is P(X \le x)=F(x) = 1 -e^{-x/\theta}. That’s the probability of an event occurring before a certain time x. The complement of the cumulative distribution is the probability of an event occurring after a certain time:

P(X > x) = 1 - P(X \le x) = 1 - (1 - e^{-x/ \theta} ) = e^{-x/ \theta}

Also recall the definition of conditional probability: P(A |B) = \frac{P(A \cap B)}{P(B)}

Let’s plug into the equality we want to prove and see what happens:

P(X > x + y | X > x) = \frac{P(X>x+y) \cap P(X>x)}{P(X > x)} = \frac{P(X>x+y)}{P(X > x)} =\frac{e^{-(x+y)/\theta}}{e^{-x/\theta}} = \frac{e^{-x/\theta}e^{-y/\theta}}{e^{-x/\theta}} = e^{-y/\theta} = P(X>y)

There you go. Not too bad.

We can actually go the other direction as well. That is, we can show that if P(X > x + y | X > x) = P(X > y) is true for a continuous random variable X, then X has an exponential distribution. Here’s how:

P(X > x + y | X > x) = P(X > y) (given)

P(1 - F(x + y) | 1 - F(x)) = 1 - F(y) (substitute the cdf expressions)

\frac{1-F(x+y) \cap 1-F(x))}{1-F(x)}=1-F(y) (using the definition of conditional probability)

\frac{1-F(x+y)}{1-F(x)}=1-F(y) (If X > x + y, then X > x)

Now substitute in generic function terminology, say h(x) = 1 - F(x):


Rearranging terms gives us h(x+y)=h(y)h(x)

Now for that equality to hold, the function h(x) has to have an exponential form, where the variable is in the exponent, like this: a^{x}. Recall that a^{x}a^{y}=a^{x+y}. If h(x) = a^{x}, then our equality above works. So we let h(x)=a^{x}. That allows to make the following conclusion:

1-F(x) = h(x) = a^{x} = e^{ln a^{x}} = e^{x ln a}

Now let b = ln a. We get 1-F(x) = e^{bx}. Solving for F(x) we get F(x) = 1 - e^{bx}. Since F(\infty) = 1, b must be negative. So we have F(x) = 1 - e^{-bx}. Now we just let b = \frac{1}{\theta} and we have the cumulative distribution function for an exponential distribution: F(x) = 1 - e^{-x/\theta}.

That’s the memoryless property for you. Or maybe it’s called the forgetfulness property. I can’t remember.

Deriving the exponential distribution

The exponential distribution looks harmless enough:

f(x) = \frac{1}{\theta}e^{-x/\theta}

It looks like someone just took the exponential function and multiplied it by \frac{1}{\theta}, and then for kicks decided to do the same thing in the exponent except with a negative sign. If we integrate this for all x>0 we get 1, demonstrating it’s a probability distribution function. So is this just a curiosity someone dreamed up in an ivory tower? No it actually turns out to be related to the Poisson distribution.

Recall the Poisson describes the distribution of probability associated with a Poisson process. That is, the number of events occurring over time or on some object in non-overlapping intervals are independent. For example, maybe the number of 911 phone calls for a particular city arrive at a rate of 3 per hour. The interval of 7 pm to 8 pm is independent of 8 pm to 9 pm. The expected number of calls for each hour is 3. The Poisson distribution allows us to find, say, the probability the city’s 911 number receives more than 5 calls in the next hour, or the probability they receive no calls in the next 2 hours. It deals with discrete counts.

Now what if we turn it around and ask instead how long until the next call comes in? Now we’re dealing with time, which is continuous as opposed to discrete. We’re limited only by the precision of our watch. Let’s be more specific and investigate the time until the first change in a Poisson process. Before diving into math, we can develop some intuition for the answer. If events in a process occur at a rate of 3 per hour, we would probably expect to wait about 20 minutes for the first event. Three per hour implies once every 20 minutes. But what is the probability the first event within 20 minutes? What about within 5 minutes? How about after 30 minutes? (Notice I’m saying within and after instead of at. When finding probabilities of continuous events we deal with intervals instead of specific points. The probability of an event occurring at a specific point in a continuous distribution is always 0.)

Let’s create a random variable called W, which stands for wait time until the first event. The probability the wait time is less than or equal to some particular time w is P(W \le w). Let’s say w=5 minutes, so we have P(W \le 5). We can take the complement of this probability and subtract it from 1 to get an equivalent expression:

P(W \le 5) = 1 - P(W>5)

Now P(W>5) implies no events occurred before 5 minutes. That is, nothing happened in the interval [0, 5]. What is the probability that nothing happened in that interval? Well now we’re dealing with events again instead of time. And for that we can use the Poisson:

Probability of no events in interval [0, 5] = P(X=0) = \frac{\lambda^{0}e^{-\lambda}}{0!}=e^{-\lambda}

So we have P(W \le 5) = 1 - P(W>5) = 1 - e^{-\lambda}

That’s the cumulative distribution function. If we take the derivative of the cumulative distribution function, we get the probability distribution function:

F'(w)=f(w)=\lambda e^{-\lambda}

And there we have the exponential distribution! Usually we let \lambda = \frac{1}{\theta}. And that gives us what I showed in the beginning:

f(x) = \frac{1}{\theta}e^{-x/\theta}

Why do we do that? That allows us to have a parameter in the distribution that represents the mean waiting time until the first change. Recall my previous example: if events in a process occur at a mean rate of 3 per hour, or 3 per 60 minutes, we expect to wait 20 minutes for the first event to occur. In symbols, if \lambda is the mean number of events, then \theta=\frac{1}{\lambda}, the mean waiting time for the first event. So if \lambda=3 is the mean number of events per hour, then the mean waiting time for the first event is \theta=\frac{1}{3} of an hour. Notice that \lambda=3=\frac{1}{1/3}=\frac{1}{\theta}.

Tying everything together, if we have a Poisson process where events occur at, say, 12 per hour (or 1 every 5 minutes) then the probability that exactly 1 event occurs during the next 5 minutes is found using the Poisson distribution (with \lambda=\frac{1}{5}):

P(X=1) = \frac{(1/5)^{1}e^{-(1/5)}}{1!}=0.164

But the probability that we wait less than some time for the first event, say 5 minutes, is found using the exponential distribution (with \theta = \frac{1}{1/5} = 5):


Now it may seem we have a contradiction here. We have a 63% of witnessing the first event within 5 minutes, but only a 16% chance of witnessing one event in the next 5 minutes.  While the two statements seem identical, they’re actually assessing two very different things. The Poisson probability is the chance we observe exactly one event in the next 5 minutes. Not 2 events, Not 0, Not 3, etc. Just 1. That’s a fairly restrictive question. We’re talking about one outcome out of many. The exponential probability, on the other hand, is the chance we wait less than 5 minutes to see the first event. This is inclusive of all times before 5 minutes, such as 2 minutes, 3 minutes, 4 minutes and 15 seconds, etc. There are many times considered in this calculation. So we’re likely to witness the first event within 5 minutes with a better than even chance, but there’s only a 16% chance that all we witness in that 5 minute span is exactly one event. The latter probability of 16% is similar to the idea that you’re likely to get 5 heads if you toss a fair coin 10 times. That is indeed the most likely outcome, but that outcome only has about a 25% chance of happening. Not impossible, but not exactly what I would call probable. Again it has to do with considering only 1 outcome out of many. The Poisson probability in our question above considered one outcome while the exponential probability considered the infinity of outcomes between 0 and 5 minutes.

Pointwise confidence interval estimates of survivor functions

The most commonly used descriptive method for survival data is the Kaplan-Meier estimator (also known as the product-limit estimator). It estimates the probability of “surviving”, or working, past a certain point. For example, if the Kaplan-Meier estimate at 32 weeks is 0.54, that means the estimated probability of surviving 32 weeks or longer is 0.54.  There are many web sites and books that will tell you all about the Kaplan-Meier estimator.

Now once you have an estimate, it’s a good idea to form a confidence interval to give you some idea of how precise your estimate is. The common approach in statistics is to take the standard error of the estimate, multiply it by 1.96, and add and subtract the product to the estimate to form a 95% confidence interval. The problem with that approach here is that our estimate is a probability that ranges in value from 0 to 1. It’s very possible that the margin of error added or subtracted to the estimate will result in an interval that falls below 0 or exceeds 1. One solution is the log-log transformation.

The log-log transformation does just what its names says, including the dash which represents a minus sign. You take the log of the estimate, make it negative, and then take the log again. The result of this transformation is a value that ranges from -\infty to +\infty, just like the normal distribution (which is implied in our use of 1.96 as the number of standard errors to use for our margin of error calculation). In addition this assures us that when we transform back to our original scale, we have limits that range from 0 to 1.

Here’s an example: say we have a Kaplan-Meier estimate of 0.88  at 13 days. That means we estimate a 88% chance of surviving beyond 13 days. Furthermore let’s say our estimate has a standard error of 0.0650. (These values are often default output in statistical programs that perform survival analysis.) Without a transformation we get the following confidence interval:

0.88 \pm 1.96(0.0650) = (0.753,1.007)

We exceed one in the upper limit, so this won’t do. Let’s do the transformation. First we transform the Kaplan-Meier estimate:


Next we transform the standard error estimate. (Just trust me on this one):


Now we can calculate our confidence interval:

-2.057 \pm 1.96(0.5777) = (-3.1894, -0.9247)

That’s great, but we have to transform it back to the original scale for it to have any meaning. To go back we take the exponential once, then twice, then take the inverse:

  1. exp(log(-log(0.88)) = -log(0.88) = log(0.88)^{-1} = log \frac{1}{0.88}
  2. exp(log \frac{1}{0.88}) = \frac{1}{0.88}
  3. (\frac{1}{0.88})^{-1} = 0.88

If we do that with our confidence limits we obtain:

(exp(exp(-0.9247))^{-1} = 0.6726

Notice that the upper limit before transforming back (-0.9247) becomes the lower limit after transformation (0.6726), and vice versa. Also notice the confidence interval is not symmetric about our estimate of 0.88.

Fortunately we don’t have to do this sort of thing by hand. Most statistical programs, if not all, give the log-log confidence intervals by default. There are other transformations you can use such as the logit and arcsine, but there is little practical difference between them (say Hosmer, Lemeshow, and May in Applied Survival Analysis, p. 32).

Having said all that, let me emphasize we calculated the 95% confidence interval for one time point at 13 days. Survival data sets will have many observed events, thus we will have many 95% confidence intervals. Now the probability that a collection of 95% confidence intervals all contain the true parameter they’re trying to estimate is much less than 95%. For example, the probability a collection of just ten 95% confidence intervals contain the true parameter is about 60%. So why do this at all? Mainly because we want to create confidence intervals for a few selected points, usually the 25th percentile, the median and the 75th percentile. We use the log-log pointwise confidence intervals to create a confidence interval for the length of time. If our events were measured in days, then our median confidence interval might be (32,65) days.

The way we construct these intervals is easy enough. Say we want a 25th percentile. Take the smallest event time such that “dying” earlier is  greater than 0.25 in the lower limit column of all your confidence intervals. Do the same in the upper limit column of all your confidence intervals. The corresponding event times become the lower and upper limit for your 25th percentile confidence interval. Again most stats programs do this for you, but it’s good to know what the program is doing behind the scenes.

Stabilization simulation of relative frequency

One of my favorite simulations is seeing how the relative frequency of an event stabilizes around a theoretical probability as the number of trials increases. For example, we say the probability of rolling a 5 when rolling a single die is \frac{1}{6}. Although it’s an obvious probability, it’s still fun to simulate 10,000 rolls of a die and see that indeed the relative frequency of the occurrence of 5’s stabilizes around \frac{1}{6}. When this concept is presented in a textbook it is often accompanied by a squiggly line graph that settles down around a straight line that represents the theoretical probability. That’s what I want to show how to do in this post, except using a dice example that does not have an obvious probability.

Let’s say we roll a single fair Las Vegas die six times and define success as follows:

  • if the number on the die matches the number of the roll at least once, we have success. (So if it’s our second roll and we roll a 2, we have success.)
  • If we go all six rolls with no result matching the order of the roll, we have failure.

While each roll of the die in 6 tries has a probability of \frac{1}{6} of matching the attempt number, there is no intuition for determining the probability of success as we defined it above. Theoretically we can calculate it easily enough. We note the probability of NOT rolling the same die value as attempt number is \frac{5}{6}. Furthermore, the probability of NOT rolling the same die value as attempt number for all six tries is (\frac{5}{6})^{6}. Now if we subtract (\frac{5}{6})^{6} from 1, we have the probability of rolling the same die value as attempt number at least once. So our probability is 1-(\frac{5}{6})^{6}=0.665102. That’s pretty high. Even though success on a single roll has probability of \frac{1}{6}, the probability we get at least one success in 6 rolls is about \frac{2}{3}. The idea that we have a better than even chance of seeing a success in 6 rolls may be intuitive, but the theoretical probability of 0.665012 is certainly not intuitive.

We can simulate this experiment many times using a computer and check whether or not the relative frequency of success converges to the theoretical probability we calculated. We can then make one of those cool line graphs and see if the relative frequency settles down around 0.665. Here’s how I did it in R:


p <- rep(NA,1000)
s <- rep(NA,1000)
    for (j in 1:1000){
        ex <- rbinom(1,6,(1/6)) #run the experiment
        s[j] <- ifelse(ex>=1,1,0) #success or failure?
        p[j] <- sum(s,na.rm = TRUE)/j


The first two lines create empty vectors to store the relative frequency (p) and whether or not we saw success in each trial (s). I then execute a for loop with 1000 iterations. Each time through the loop I run the experiment using the rbinom function, I check to see if I have success using the ifelse function (storing 1 if "yes" and 0 otherwise), and then calculate the relative frequency (p). Notice that the relative frequency uses information from the previous trials in each iteration. For example, I could see a sequence such as...

  1. roll six times, get a success: p = 1/1
  2. roll six times, get a success: p= 2/2
  3. roll six times, no success: p = 2/3
  4. roll six times, get a success: p = 3/4
  5. roll six times, get a success: p = 4/5

Each time the denominator increases and matches the trial number. The numerator is always the previous numerator plus 1 if there was a success, or plus 0 if no success.

Once you run the simulation, you can graph it with the following:


n <- 1:1000
plot(n,p,type="l",main="convergence to probability as n increases")
lines(c(1,1000),c(0.665, 0.665))


The first line creates my x-axis values. The second line makes the plot. The third line adds a horizontal line in the graph to represent the theoretical probability. Here's how my graph turned out:

Click on it to see the graph in full-size glory.

Simulating outcomes of a simple experiment

One of the first things you learn in an introductory mathematical statistics class is how to assign a probability model to the outcomes of a simple experiment. For example, say we have an urn containing the following:

  • 2 blue marbles
  • 3 green marbles
  • 3 red marbles
  • 2 yellow marbles

The experiment is to select one marble at random and note the color. Our intuition tells us to assign probabilities as follows:

  • P(blue) = \frac{2}{10}
  • P(green) = \frac{3}{10}
  • P(red) = \frac{3}{10}
  • P(yellow) = \frac{2}{10}

If we code our colors as 1=blue, 2=green, 3=red and 4=yellow, then our probability model is given by the probability mass function (p.m.f):

f(1)=\frac{2}{10}, f(2)=\frac{3}{10}, f(3)=\frac{3}{10}, f(4)=\frac{2}{10}

We can construct a probability histogram of this p.m.f. using R as follows:


ht <- rep(c(1,2,3,4),c(2,3,3,2))


This gives us the nice histogram:

If we were to actually carry out this experiment a number of times and make a histogram, we would expect to see a similar shape. This can be easily simulated on a computer. Unfortunately many textbooks don't actually tell you how to do this. Probably because doing so would mean documenting some method using some program that will likely change or be obsolete at some point. But here I show you how it can be done in R. And I can't imagine this method will go out of date any time soon.

The easy part is the simulation. One call to the rmultinom function does the trick:


rmultinom(1, 500, c(2,3,3,2))


The 1 means I'm drawing one item from the four categories containing 2, 3, 3, and 2 elements each. The 500 means I do it 500 times. The c(2,3,3,2) is a vector representing the number of colored marbles I have in each category. I also could have used the probabilities: c (0.2,0.3,0.3,0.2).

When I call this function I need to remember the results, so I give it a name, "h":


h <- rmultinom(1, 100, c(2,3,3,2))


"h" stores a matrix of numbers:


[1,] 100
[2,] 159
[3,] 141
[4,] 100


These are the results of the 500 experiments. I picked 100 blues, 159 greens, 141 reds and 100 yellows. Now to make the histogram. What I choose to do is create a vector of 500 items using the codes I assigned above (1=blue, 2=green, 3=red, and 4=yellow). I can do this using the rep() function as follows:


hx <- rep(c(1,2,3,4),c(h[1,1],h[2,1],h[3,1],h[4,1]))


The first part - c(1,2,3,4) - says I want to repeat the numbers 1, 2, 3, and 4. The second part - c(h[1,1],h[2,1],h[3,1],h[4,1]) - indicates how many times to repeat each number. h[1,1] references row 1, column 1 of the matrix called "h"; h[2,1] reference row 2, column 1 of the matrix; and so forth. Now I can use the hist function to create our histogram:




Here's the result:

As expected, it looks pretty close to the theoretical histogram we created first. Again, nothing mind-blowing in this post. But I did want to document one way to simulate these simple experiments we so often see early in our statistical studies.

Investigating the chi-square test of independence

Survey a random sample of 500 men and ask if they have seen Star Wars more than twice. Survey 500 women and ask the same question. We might guess that the proportion of people in each gender who say “yes” will differ significantly. In other words, the likelihood of answering the question one way or another depends on the gender. What if we surveyed a random sample of 500 men and 500 women and asked whether or not their birthday was in December? We probably wouldn’t expect gender to influence the likelihood of having a birthday in December.

In my first example, the two variables are…

  1. Have you seen Star Wars more than twice? (Yes/No)
  2. Gender (Male/Female)

In the second example the two variables are…

  1. Is your birthday in December? (Yes/No)
  2. Gender (Male/Female)

Our question is whether or not these variables are associated.The quick way to answer this question is the chi-square test of independence. There are many good web pages that explain how to do this test and interpret the results, so I won’t re-invent the iPhone. Instead, what I want to demonstrate is why it’s called the “chi-square” test of independence.

Let’s look at the statistic for this test:


k = the number of combinations possible between the two variables. For example,

  1. Male & seen Star Wars more than twice
  2. Male & have not seen Star Wars more than twice
  3. Female & seen Star Wars more than twice
  4. Female & have not seen Star Wars more than twice

O = the OBSERVED number of occurrences for each combination. Perhaps in my survey I found 12 females out of 500 who have seen Star Wars more than twice and 488 who have not. By contrast, I found 451 males who have seen it twice and 49 who have not.

E = the EXPECTED number of occurrences IF the variables were NOT associated.

The chi-square test of independence is a hypothesis test. Therefore we make a falsifiable assumption about the two variables and calculate the probability of witnessing our data given our assumption. In this case, the falsifiable assumption is that the two variables are independent. If the two variables really are independent of one another, how likely are we to see the data we just saw? That’s a rough interpretation of the p-value.

Now let’s use R to generate a bunch of data on two independent categorical variables (each with two levels),  run 1000 chi-square tests on these variables, and plot a histogram of the chi-square statistic:

p <- 0.4
n1 <- 500
n2 <- 500
n <- n1 + n2
x <- rep(NA,1000)
for (i in 1:1000){
	n11 <- sum(rbinom(n1,1,p))
	n21 <- n1 - n11
	n12 <- sum(rbinom(n2,1,p))
	n22 <- n2 - n12

	table <- matrix(c(n11,n21,n12,n22),nrow=2)
	test <- chisq.test(table,correct=F)

	x[i] <- test$statistic
hist(x, freq=F)


You can see I generated two binomial samples 500 times with n=1 and p = 0.4. You can think of it as flipping a coin 500 times (or asking 500 girls if they have seen Star Wars twice). I then took the sum of the "successes" (or girls who said yes). The probabilities for both samples are the same, so they really are independent of one another. After that I calculated the chi-square test statistic using R's chisq.test function and stored it in a vector that holds a 1000 values.  Finally R made us a nice histogram. As you can see, when you run the test 1000 times, you most likely will get a test statistic value below 4.

Now let's plot a chi-square distribution with one degree freedom on top of the histogram and show that the chi-square test statistic does indeed follow a chi-square distribution approximately. Note that I put approximately in italics. I really wanted you to see that. Oh, and why one degree of freedom? Because given the row and column totals in your 2 x 2 table where you store your results, you only need one cell to figure out the other three. For example, if I told you I had 1000 data points in my Star Wars study (500 men and 500 women), and that I had 650 "yes" responses to my question and 350 "no" responses, I would only need to tell you, say, that 21 women answered "yes" in order for you to figure out the other three values (men-yes, men-no, women-no). In other words, only one out of my four table cells is free to vary. Hence the one degree of freedom. Now about that histogram.

h <- hist(x, plot=F)
ylim <- range(0, h$density, 0.75)
hist(x, freq=F, ylim=ylim)
curve(dchisq(x,df=1), add = TRUE)

When I said it was approximate, I wasn't kidding was I? Below 1 the fit isn't very good. Above 2 it gets much better. Overall though it's not bad. It's approximate, and that's often good enough. Hopefully this helps you visually see where this test statistic gets its name. Often this test gets taught in a class or presented in a textbook with no visual aids whatsoever. The emphasis is on the tables and the math. But here I wanted to crawl under the hood and show you how the chi-square distribution comes into play.

Explaining and simulating a chi-square distribution

Randomly select several values, say 10, from a standard Normal distribution. That’s a Normal distribution with a mean of 0 and standard deviation of 1. Now square each value and sum them all up. That sum is an observation from a chi-square distribution with 10 degrees of freedom. If you picked 12 values and did the same, you would have an observation from a chi-square distribution with 12 degrees of freedom.

We can do this in R as follows:

# random sample of 10 values from N(0,1) summed up
s <- rnorm(10)

When I did it, I got the following:

s <- rnorm(10)
[1] 2.3235602  0.1339985  0.5645995  1.3603184 -1.1383273  0.1735691
[7] 0.9075259  0.2842374 -0.4214184 -0.1069466
[1] 5.39893181 0.01795559 0.31877258 1.85046616 1.29578902 0.03012623
[7] 0.82360328 0.08079092 0.17759345 0.01143758
[1] 10.00547

Now let's repeat this 1000 times and make a histogram:

x <- rep(NA, 1000)
for (j in 1:1000){
x[j] <-sum(rnorm(10)^2)}

You can click on the image to see bigger version. Obviously we don't see the Normal-looking bell-shape. It skews right. Also notice that since we squared all our values, every sum is positive, which means the histogram starts at 0. Now let's plot a chi-square curve with 10 degrees of freedom on top of our histogram:

h <- hist(x, plot=F)
ylim  <- range(0, h$density, 0.10)
hist(x, freq=F, ylim=ylim)
curve(dchisq(x,df=10), add = TRUE)

I had to experiment a little to get the 0.10 value for the range statement. That was necessary to do in order to keep R from setting the displayed range of the y axis from being to small.

The formula for the chi-square curve (i.e., the probability density function) is


where r is degrees of freedom and x must be positive. And, yes, I think it looks pretty frightening if you're not used to that sort of thing.

Why the moment-generating function works

In a previous post, I described how a moment-generating function does what its name says it does. Still, though, it’s not clear why it works. I mean, look at the formula:


That’s the discrete case. In the continuous case we have:


So you take the usual expected value formula for either discrete or continuous cases, stick e^{tx} in there, and you get a function that generates moments. I don’t know about you, but that doesn’t make me pop up out of my chair and exclaim, “oh, of course, I see it! How obvious.” Fortunately, it’s not beyond comprehension. To see why it does what it does, let’s take the derivative of the moment-generating function in the discrete case, evaluate it at 0, and see what happens:

\frac{d}{dt}M(t)=\frac{d}{dt}\sum_{x}e^{tx}f(x) =\sum_{x}\frac{d}{dt}e^{tx}f(x) =\sum_{x}e^{tx}xf(x) =EXe^{tX}

Now evaluate the derivative at 0:

\frac{d}{dt}M(0)=EXe^{0X} = EX

So evaluating the first derivative of the moment-generating function at 0 gives us EX, the first moment about the origin, also known as the mean.

Let’s evaluate the second derivative of the moment-generating function at 0 and see what happens:

\frac{d^{2}}{dt^{2}}M(t)=\frac{d^{2}}{dt^{2}}\sum_{x}e^{tx}f(x) =\sum_{x}\frac{d}{dt}xe^{tx}f(x) =\sum_{x}e^{tx}(0)+xe^{tx}xf(x) =\sum_{x}x^{2}e^{tx}f(x) =EX^{2}e^{tX}

Now evaluate the derivative at 0:

\frac{d^{}2}{dt^{2}}M(0)=EX^{2}e^{0X} = EX^{2}

That gives us the second moment about the origin, which comes in handy when finding variance, since variance is equal to Var(x) = EX^{2}-(EX)^{2}.

This will keep happening for the third, fourth, fifth and every derivative thereafter, provided the moment-generating function exists. And lest you think that’s just some technicality you never have to worry about, you should know the venerable t and F distributions do not have moment-generating functions. It turns out some of their moments are infinite. So there you go.