Category Archives: Expectation

A Probability Problem in Heredity – Part 2

In my last post I solved a problem from chapter 2 of M.G. Bulmer’s Principles of Statistics. In this post I work through a problem in chapter 11 that is basically a continuation of the chapter 2 problem. If you take a look at the previous post, you’ll notice we were asked to find probability in terms of theta. I did it and that’s nice and all, but we can go further. We actually have data, so we can estimate theta. And that’s what the problem in chapter 11 asks us to do. If you’re wondering why it took 9 chapters to get from finding theta to estimating theta, that’s because the first problem involved basic probability and this one requires maximum likelihood. It’s a bit of a jump where statistical background is concerned.

The results of the last post were as follows:

purple-flowered red-flowered
long pollen \( \frac{1}{4}(\theta + 2)\) \( \frac{1}{4}(1 – \theta) \)
round pollen \( \frac{1}{4}(1 – \theta) \) \( \frac{1}{4}\theta \)

The table provides probabilities of four possible phenotypes when hybrid sweet peas are allowed to self-fertilize. For example, the probability of a self-fertilizing sweet pea producing a purple-flower with long pollen is \( \frac{1}{4}(1 – \theta)\). In this post we’ll estimate theta from our data. Recall that \( \theta = (1 – \pi)^{2} \), where \( \pi \) is the probability of the dominant and recessive genes of a characteristic switching chromosomes.

Here’s the data:

Purple-flowered Red-flowered
Long pollen 1528 117
Round pollen 106 381

We see from the table there are 4 exclusive possibilities when the sweet pea self-fertilizes. If we think of each possibility having its own probability of occurrence, then we can think of this data as a sample from a multinomial distribution. Since chapter 11 covers maximum likelihood estimation, the problem therefore asks us to use the multinomial likelihood function to estimate theta.

Now the maximum likelihood estimator for each probability is \( \hat{p_{i}} = \frac{x_{i}}{n} \). But we can’t use that. That’s estimating four parameters. We need to estimate one parameter, theta. So we need to go back to the multinomial maximum likelihood function and define \( p_{i}\) in terms of theta. And of course we’ll work with the log likelihood since it’s easier to work with sums than products.

\( \log L(\theta) = y_{1} \log p_{1} + y_{2} \log p_{2} + y_{3} \log p_{3} + y_{4} \log p_{4} \)

If you’re not sure how I got that, google “log likelihood multinomial distribution” for more PDF lecture notes than you can ever hope to read.

Now let’s define the probabilities in terms of one parameter, theta:

\( \log L(\theta) = y_{1} \log f_{1}(\theta) + y_{2} \log f_{2}(\theta) + y_{3} \log f_{3}(\theta) + y_{4} \log f_{4}(\theta) \)

Now take the derivative. Once we have that we can set equal to 0 and find a solution for theta. The solution will be the point at which theta obtains its maximum value:

\( \frac{d \log L(\theta)}{d\theta} = \frac{y_{1}}{f_{1}(\theta)} f’_{1}(\theta) + \frac{y_{1}}{f_{2}(\theta)} f’_{2}(\theta) + \frac{y_{1}}{f_{3}(\theta)} f’_{3}(\theta) + \frac{y_{1}}{f_{4}(\theta)} f’_{4}(\theta)\)

Time to go from the abstract to the applied with our values. The y’s are the data from our table and the functions of theta are the results from the previous problem.

\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{1/4(2 + \theta)} \frac{1}{4} – \frac{117}{1/4(1 – \theta)}\frac{1}{4} – \frac{106}{1/4(1 – \theta)}\frac{1}{4} + \frac{381}{1/4(\theta)} \frac{1}{4} \)
\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{2 + \theta} – \frac{117}{1 – \theta} – \frac{106}{1 – \theta} + \frac{381}{\theta} \)
\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{2 + \theta} – \frac{223}{1 – \theta} + \frac{381}{\theta} \)

Set equal to 0 and solve for theta. Beware, it gets messy.

\( \frac{[1528(1 – \theta)\theta] – [223(2 + \theta)\theta] + [381(2 + \theta)(1 – \theta)]}{(2 + \theta)(1- \theta)\theta} = 0\)

Yeesh. Fortunately the denominator cancels out when we start multiplying terms and solving for theta. So we’re left with this:

\( 1528\theta – 1528\theta^{2} – 446\theta – 223\theta^{2} + 762 – 381\theta – 381\theta^{2} = 0\)

And that reduces to the following quadratic equation:

\( -2132\theta^{2} + 701\theta + 762 = 0\)

I propose we use an online calculator to solve this equation. Here’s a good one. Just plug in the coefficients and hit solve to find the roots. Our coefficients are a = -2132, b = 701, and c = 762. Since it’s a quadratic equation we get two answers:

\( x_{1} = -0.4556 \)
\( x_{2} = 0.7844 \)

The answer is in terms of a probability which is between 0 and 1, so we toss the negative answer and behold our maximum likelihood estimator for theta: 0.7844.

Remember that \( \theta = (1 – \pi)^{2}\). If we solve for pi, we get \( \pi = 1 – \theta^{1/2}\), which works out to be 0.1143. That is, we estimate the probability of characteristic genes switching chromosomes to be about 11%. Therefore we can think of theta as the probability of having two parents that experienced no gene switching.

Now point estimates are just estimates. We would like to know how good the estimate is. That’s where confidence intervals come in to play. Let’s calculate one for theta.

It turns out that we can estimate the variability of our maximum likelihood estimator as follows:

\( V(\theta) = \frac{1}{I(\theta)}\), where \( I(\theta) = -E(\frac{d^{2}\log L}{d \theta^{2}}) \)

We need to determine the second derivative of our log likelihood equation, take the expected value by plugging in our maximum likelihood estimator, multiply that by -1, and then take the reciprocal. The second derivative works out to be:

\( \frac{d^{2}\log L}{d \theta^{2}} = -\frac{1528}{(2 + \theta)^{2}} -\frac{223}{(1 – \theta)^{2}} -\frac{381}{\theta^{2}} \)

The negative expected value of the second derivative is obtained by plugging in our estimate of 0.7844 and multiplying by -1. Let’s head to the R console to calculate this:

th <- 0.7844 # our ML estimate
(I <- -1 * (-1528/((2+th)^2) - 223/((1-th)^2) - 381/(th^2))) # information
[1] 5613.731

Now take the reciprocal and we have our variance:

( <- 1/I)
[1] 0.0001781347

We can take the square root of the variance to get the standard error and multiply by 1.96 to get the margin of error for our estimate. Then add/subtract the margin of error to our estimate for a confidence interval:

# confidence limits on theta
0.784+(1.96*sqrt( # upper bound
[1] 0.8101596
0.784-(1.96*sqrt( # lower bound
[1] 0.7578404

Finally we convert the confidence interval for theta to a confidence interval for pi:

# probability of switching over
th.ub <- 0.784+(1.96*sqrt( <- 0.784-(1.96*sqrt(
1-sqrt(th.ub) # lower bound
[1] 0.09991136
1-sqrt( # upper bound
[1] 0.1294597

The probability of genes switching chromosomes is most probably in the range of 10% to 13%.

The standard deviation of the sampling distribution of the mean

Someone on Reddit posted Chapter 1 of Howard Wainer’s book, Picturing the Uncertain World. The name of the chapter is The Most Dangerous Equation and its subject is the formula for the standard deviation of the sampling distribution of the mean. This little guy:


Why is it dangerous? The danger comes from not understanding it. Wainer gives several examples of how ignorance of this equation has “led to billions of dollars of loss over centuries, yielding untold hardship.” But before diving into the examples he briefly describes the formula. It’s not a proof but rather a heuristic demonstration of how it works:

…if we measure, for example, the heights of, say, 1000 students at a particular high school, we might find that the average height is 67 inches, but heights might range from perhaps as little as 55 inches to as much as 80 inches. A number that characterizes this variation is the standard deviation….But now suppose we randomly grouped the 1000 children into 10 groups of 100 and calculated the average within each group. The variation of these 10 averages would likely be much smaller than [the standard deviation of all 1000 students]…

This got me to thinking about how I could demonstrate this in R. It seems like we could generate a population, then take repeated samples of size n to create a sampling distribution, and then show that the standard deviation of the sampling distribution is indeed equal to the population standard deviation divided by n. Let’s do it.

First I load the gtools package which has a function called permutations(). As you might guess, it takes a vector of values and generates all permutations of a given size. After that I generate a “population” of 10 values using the rnorm() function. So my 10 values come from a normal distribution with mean 10 and standard deviation 2, but I’m treating these 10 values as if they’re my entire population for this toy example.

# generate population of size 10
pop <- rnorm(10,10,2)

Now we're ready to generate the sampling distribution. I decided to let n = 3. This means we need to generate every possible sample of size 3. This is where the permutations() function comes. The first argument tells it how big the source vector is, the second states the size of the target vectors, and the third is the source vector. The last tells it to allow repeats. This is important. This replicates sampling with replacement, which is necessary to demonstrate this formula using a finite population.

# generate the sampling distribution
# first generate all samples of size 3
sdist <- permutations(10,3,pop,repeats=TRUE)

If you look at the variable sdist, you'll see it's a 1000 x 3 matrix. That's 1000 permutations of size 3 from our original population. Next we take the mean of each row (or sample of size 3):

# find the mean of all samples (in rows)
sdm <- apply(sdist,1,mean)

The variable "sdm" is our sampling distribution of the mean. We took every possible sample of size 3 from our population and calculated the mean of each sample. Now the first thing to note is that the mean of the sampling distribution is equal to the mean of our population:

mean(pop) == mean(sdm)
[1] TRUE

But what I really wanted to verify was the formula \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}, or equivalently \sigma_{\bar{x}}^{2} = \frac{\sigma^{2}}{n}:

sum(sdm^2)/1000 - (mean(sdm)^2) == (sum(pop^2)/10 - (mean(pop)^2))/3
[1] TRUE

Aha! It's true! Of course this isn't a proof, but it came out like we expected. Notice I had to manually find the population variance in each case using \sigma^{2} = E(X^{2}) - \mu^{2}. That's because the var() and sd() functions in R divide by n-1.

The moment generating function of the gamma

The moment-generating function of the gamma distribution is M(t) = (1-\theta t)^{-\alpha}. Such a friendly little guy. Not what you would expect when you start with this:

M(t) = \int_{0}^{\infty}e^{tx}\frac{1}{\Gamma (\alpha)\theta^{\alpha}}x^{\alpha - 1}e^{-x/\theta}dx

How do we get there? First let’s combine the two exponential terms and move the gamma fraction out of the integral:

M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}x^{\alpha - 1}e^{tx-x/\theta}dx

Multiply tx in the exponential by \frac{\theta}{\theta} = 1, add the two terms together and factor out -x:

M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}x^{\alpha - 1}e^{-x(1-\theta t)/\theta}dx

Now we’re ready to do so substitution in the integral. Let u = \frac{x(1-\theta t)}{\theta}. That means we have du = \frac{1-\theta t}{\theta}dx. Now solve those for x and dx, respectively. We get x = \frac{\theta u}{1-\theta t} and dx = \frac{\theta}{1-\theta t}du. Now substitute those in the integral to see something that looks truly awful:

M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}\frac{\theta u}{1-\theta t}^{\alpha - 1}e^{-u}\frac{\theta}{1-\theta t}du

If re-write everything inside the integral with exponents we can do some cancellation and make this integral more attractive:

M(t) = \frac{1}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}(1-\theta t)^{-\alpha + 1}(1-\theta t)^{-1}(\theta u)^{\alpha-1}\theta e^{-u}du

Using the rules of exponents we see that (1-\theta t)^{-\alpha + 1}(1-\theta t)^{-1} = (1- \theta t)^{-\alpha} which can be moved out of the integral since it doesn’t have u in it:

M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}(\theta u)^{\alpha-1}\theta e^{-u}du

Using the rules of exponents again we see that \theta^{\alpha -1}\theta = \theta^{\alpha}:

M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)\theta^{\alpha}}\int_{0}^{\infty}\theta^{\alpha}u^{\alpha - 1} e^{-u}du

Now notice we can cancel out the two \theta^{\alpha} terms:

M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)}\int_{0}^{\infty}u^{\alpha - 1} e^{-u}du

The integral is now the gamma function: \Gamma (\alpha) = \int_{0}^{\infty}u^{\alpha - 1} e^{-u}du. Make that substitution:

M(t) = \frac{(1- \theta t)^{-\alpha}}{\Gamma (\alpha)}\Gamma(\alpha)

Cancel out the \Gamma(\alpha) terms and we have our nice-looking moment-generating function:

M(t) = (1- \theta t)^{-\alpha}

If we take the derivative of this function and evaluate at 0 we get the mean of the gamma distribution:

M'(t) = -\alpha (1-\theta t)^{-\alpha -1}(-\theta) M'(0) = -\alpha(1 - 0)^{-\alpha -1}(-\theta) = \alpha\theta = E(X)

Recall that \theta is the mean time between events and \alpha is the number of events. Multiply them together and you have the mean. This makes sense. If the mean wait time between events is 2 minutes (\theta = 2) and we’re interested in the distribution of times until 4 events occur (\alpha = 4), it seems reasonable to think the mean wait time to observe the 4th event would be 8 minutes (4(2)).

If we take the second derivative of the moment-generating function and evaluate at 0, we get the second moment about the origin which we can use to find the variance:

M''(t) = (-\alpha - 1)\theta\alpha(1-\theta t)^{-\alpha - 2}(-\theta) M''(0) = (-\alpha - 1)\theta\alpha(1-0)^{-\alpha - 2}(-\theta) = (-\alpha - 1)(-\theta^{2})\alpha

Now find the variance:

Var(X) = (-\alpha - 1)(-\theta^{2})\alpha - (\alpha\theta)^{2} Var(X) = \alpha^{2}\theta^{2} + \alpha\theta^{2} - \alpha^{2}\theta^{2} Var(X) = \alpha\theta^{2}

Going back to our example with \alpha = 4 (number of events) and \theta=2 (mean time between events), we have as our variance 4(2^{2})=16. The square root of that gives our standard deviation as 4.

Plugging the mean and standard deviation into the dgamma function in R, we can plot this particular gamma distribution:

xval <- seq(0,30, by = 0.01)
ga.4 <- dgamma(xval, shape=4, scale=2)
plot(xval,ga.4,col=1, type = "l", ylab="f(x)", xlab="x")
title(main="Gamma Distribution, scale = 2, shape = 4")

gamma with scale = 2 and shape = 4

Why the moment-generating function works

In a previous post, I described how a moment-generating function does what its name says it does. Still, though, it’s not clear why it works. I mean, look at the formula:


That’s the discrete case. In the continuous case we have:


So you take the usual expected value formula for either discrete or continuous cases, stick e^{tx} in there, and you get a function that generates moments. I don’t know about you, but that doesn’t make me pop up out of my chair and exclaim, “oh, of course, I see it! How obvious.” Fortunately, it’s not beyond comprehension. To see why it does what it does, let’s take the derivative of the moment-generating function in the discrete case, evaluate it at 0, and see what happens:

\frac{d}{dt}M(t)=\frac{d}{dt}\sum_{x}e^{tx}f(x) =\sum_{x}\frac{d}{dt}e^{tx}f(x) =\sum_{x}e^{tx}xf(x) =EXe^{tX}

Now evaluate the derivative at 0:

\frac{d}{dt}M(0)=EXe^{0X} = EX

So evaluating the first derivative of the moment-generating function at 0 gives us EX, the first moment about the origin, also known as the mean.

Let’s evaluate the second derivative of the moment-generating function at 0 and see what happens:

\frac{d^{2}}{dt^{2}}M(t)=\frac{d^{2}}{dt^{2}}\sum_{x}e^{tx}f(x) =\sum_{x}\frac{d}{dt}xe^{tx}f(x) =\sum_{x}e^{tx}(0)+xe^{tx}xf(x) =\sum_{x}x^{2}e^{tx}f(x) =EX^{2}e^{tX}

Now evaluate the derivative at 0:

\frac{d^{}2}{dt^{2}}M(0)=EX^{2}e^{0X} = EX^{2}

That gives us the second moment about the origin, which comes in handy when finding variance, since variance is equal to Var(x) = EX^{2}-(EX)^{2}.

This will keep happening for the third, fourth, fifth and every derivative thereafter, provided the moment-generating function exists. And lest you think that’s just some technicality you never have to worry about, you should know the venerable t and F distributions do not have moment-generating functions. It turns out some of their moments are infinite. So there you go.

The moment-generating function

The moment-generating function (or mgf) is probably better understood if you simply focus on its name rather than its formula. It’s a function that generates moments. Of course it helps to know what “moments” are. A moment is a term from mechanics for the product of a distance and its weight. If you have several distances and associated weights, you can calculate all the products and sum them and get what you call the “first moment about the origin.” If that’s a little too abstract, just know that the calculation of the “first moment about the origin” is the same calculation one uses to find the mean of a random variable. Thus the mean of a random variable is the same as the first moment about the origin. Therefore we can use the mgf to generate the first moment about the origin, and thus find the mean of a random variable. But the moment-generating function doesn’t stop there. It also generates second moments, third moments, etc., provided the mgf exists (a detail I’m not going to address in this post). Since the variance of a random variable can be calculated using the first and second moments about the origin, we can use the mgf to find variance as well as the mean.

So that’s the big idea: the moment-generating function provides another (sometimes easier) way to find the mean and variance of a random variable. That’s why it’s taught in upper-level statistics (even though many instructors dive into this topic without providing any motivation for it). Recall the usual formulas for finding the mean and variance: \mu = \sum_{x\in S}xf(x) and \sigma^{2} = \sum_{x\in S}(x-\mu)^{2}f(x) . These can lead to some awfully long and hard math, especially for finding the variance. Even the variance “shortcut” \sigma^{2} =E[X^{2}]-\mu^{2} can get bogged down in finding E[X^{2}] = \sum_{x\in S}x^{2}f(x). This is where the moment-generating function can be of service. Let’s introduce it (for discrete random variables):

mgf: M(t) = E(e^{tX}) = \sum_{x\in S}e^{tX}f(x)

I’ll admit it doesn’t look very friendly. And it’s certainly not obvious that it would help you find the mean and variance of a distribution. Why it works is a topic for another post, a post I’ll almost certainly never write. But seriously, why the mgf does what it does is explained rather well in one of my favorite Dover books, Principles of Statistics by M.G. Bulmer. A highly recommended book at only $15. But say you believe me and take it on faith that the mgf can help us find the mean and variance. How does it work? Well, here are the steps you usually follow:

  1. find the mgf
  2. find the first derivative of the mgf
  3. find the second derivative of the mgf
  4. solve the first derivative of the mgf for t=0. That gives you the mean.
  5. solve the second derivative of the mgf for t=0. That gives you E[X^{2}]
  6. subtract the mean squared from E[X^{2}] (ie, use the formula \sigma^{2} =E[X^{2}]-\mu^{2})

Let’s do a simple example. Say we have a probability mass function (pmf) f(x) = (4 - x) / 6 with x = 1, 2, 3. (You can verify it’s a pmf by calculating f(1), f(2) and f(3), and adding the results. They sum to 1.) Use the moment-generating function to find the mean and variance.

Following the steps above we first find the mgf:

M(t) = E(e^{tX}) =e^{1t}((4-1)/6)+e^{2t}((4-2)/6)+e^{3t}((4-3)/6) M(t) = \frac{3}{6}e^{t}+\frac{2}{6}e^{2t}+\frac{1}{6}e^{3t}

Next find the first and second derivatives:

M^{\prime}(t) = \frac{3}{6}e^{t}+\frac{4}{6}e^{2t}+\frac{3}{6}e^{3t} M^{\prime\prime}(t) = \frac{3}{6}e^{t}+\frac{8}{6}e^{2t}+\frac{9}{6}e^{3t}

Now solve M^{\prime}(0) to find the mean:

\mu = M^{\prime}(0) = \frac{3}{6}e^{0}+\frac{4}{6}e^{0}+\frac{3}{6}e^{0} = \frac{3}{6}+\frac{4}{6}+\frac{3}{6} = \frac{10}{6} \approxeq 1.667

Finally solve M^{\prime\prime}(0) = E[X^{2}] to find the variance:

M^{\prime\prime}(0) = \frac{3}{6}e^{0}+\frac{8}{6}e^{0}+\frac{9}{6}e^{0}=\frac{3}{6}+\frac{8}{6}+\frac{9}{6} = \frac{20}{6} \sigma^{2} = M^{\prime\prime}(0) - \mu^{2} = \frac{20}{6} - \frac{10}{6}^{2} = \frac{5}{9} \approxeq 0.556

Of course you wouldn’t normally employ the mgf for a simple distribution like this, but I thought it was a nice way to illustrate the concepts of the moment-generating function. By the way, I took this example from the textbook Probability and Statistical Inference (7th ed) by Hogg and Tanis (p. 101). Another highly recommended book.