Tag Archives: Bulmer

A Probability Problem in Heredity – Part 3

In my previous two posts I showed worked solutions to problems 2.5 and 11.7 in Bulmer’s Principles of Statistics, both of which involve the characteristics of self-fertilizing hybrid sweet peas. It turns out that problem 11.8 also involves this same topic, so why not work it as well for completeness. The problem asks us to assume that we were unable to find an explicit solution for the maximum likelihood equation in problem 11.7 and to solve it by using the following iterative method:

\( \theta_{1} = \theta_{0} + \frac{S(\theta_{0})}{I(\theta_{0})} \)

where \( S(\theta_{0}) \) is the value of \( \frac{d \log L}{d\theta}\) evaluated at \( \theta_{0}\) and \( I(\theta_{0})\) is the value of \( -E(\frac{d^{2}\log L}{d\theta^{2}})\) evaluated at \( \theta_{0}\).

So we begin with \( \theta_{0}\) and the iterative method returns \( \theta_{1}\). Now we run the iterative method again starting with \( \theta_{1}\) and get \( \theta_{2}\):

\( \theta_{2} = \theta_{1} + \frac{S(\theta_{1})}{I(\theta_{1})} \)

We repeat this process until we converge upon a value. This is called the Newton-Raphson method. Naturally this is something we would like to have the computer do for us.

First, recall our formulas from problem 11.7:

\( \frac{d \log L}{d\theta} = \frac{1528}{2 + \theta} – \frac{223}{1 – \theta} + \frac{381}{\theta} \)
\( \frac{d^{2}\log L}{d \theta^{2}} = -\frac{1528}{(2 + \theta)^{2}} -\frac{223}{(1 – \theta)^{2}} -\frac{381}{\theta^{2}} \)

Let’s write functions for those in R:

# maximum likelihood score
mls <- function(x) {
	1528/(2 + x) - 223/(1 - x) + 381/x
	}
# the information
inf <- function(x) {
	-1528/((2 + x)^2) - 223/((1 - x)^2) - 381/(x^2)
	}

Now we can use those functions in another function that will run the iterative method starting at a trial value:

# newton-raphson using expected information matrix
nr <- function(th) {
 prev <- th
 repeat {
   new <- prev + mls(prev)/-inf(prev)
   if(abs(prev - new)/abs(new) <0.0001)
     break
   prev <- new
  }
new
}	

This function first takes its argument and names it "prev". Then it starts a repeating loop. The first thing the loop does it calculate the new value using the iterative formula. It then checks to see if the difference between the new and previous value - divided by the new value - is less than 0.0001. If it is, the loop breaks and the "new" value is printed to the console. If not, the loop repeats. Notice that each iteration is hopefully converging on a value. As it converges, the difference between the "prev" and "new" value will get smaller and smaller. So small that dividing the difference by the "new" value (or "prev" value for that matter) will begin to approach 0.

To run this function, we simply call it from the console. Let's start with a value of \( \theta_{0} = \frac{1}{4}\), as the problem suggests:

nr(1/4)
[1] 0.7844304

There you go! We could make the function tell us a little more by outputting the iterative values and number of iterations. Here's a super quick and dirty way to do that:

# newton-raphson using expected information matrix
nr <- function(th) {
 k <- 1 # number of iterations
 v <- c() # iterative values
  prev <- th
  repeat {
    new <- prev + mls(prev)/-inf(prev)
    v[k] <- new
    if(abs(prev - new)/abs(new) <0.0001)
     break
    prev <- new
    k <- k + 1
    }
print(new) # the value we converged on
print(v) # the iterative values
print(k) # number of iterations
}

Now when we run the function we get this:

nr(1/4)
[1] 0.7844304
[1] 0.5304977 0.8557780 0.8062570 0.7863259 0.7844441 0.7844304
[1] 6

We see it took 6 iterations to converge. And with that I think I've had my fill of heredity problems for a while.

A Probability Problem in Heredity – Part 2

In my last post I solved a problem from chapter 2 of M.G. Bulmer’s Principles of Statistics. In this post I work through a problem in chapter 11 that is basically a continuation of the chapter 2 problem. If you take a look at the previous post, you’ll notice we were asked to find probability in terms of theta. I did it and that’s nice and all, but we can go further. We actually have data, so we can estimate theta. And that’s what the problem in chapter 11 asks us to do. If you’re wondering why it took 9 chapters to get from finding theta to estimating theta, that’s because the first problem involved basic probability and this one requires maximum likelihood. It’s a bit of a jump where statistical background is concerned.

The results of the last post were as follows:

purple-flowered red-flowered
long pollen \( \frac{1}{4}(\theta + 2)\) \( \frac{1}{4}(1 – \theta) \)
round pollen \( \frac{1}{4}(1 – \theta) \) \( \frac{1}{4}\theta \)

The table provides probabilities of four possible phenotypes when hybrid sweet peas are allowed to self-fertilize. For example, the probability of a self-fertilizing sweet pea producing a purple-flower with long pollen is \( \frac{1}{4}(1 – \theta)\). In this post we’ll estimate theta from our data. Recall that \( \theta = (1 – \pi)^{2} \), where \( \pi \) is the probability of the dominant and recessive genes of a characteristic switching chromosomes.

Here’s the data:

Purple-flowered Red-flowered
Long pollen 1528 117
Round pollen 106 381

We see from the table there are 4 exclusive possibilities when the sweet pea self-fertilizes. If we think of each possibility having its own probability of occurrence, then we can think of this data as a sample from a multinomial distribution. Since chapter 11 covers maximum likelihood estimation, the problem therefore asks us to use the multinomial likelihood function to estimate theta.

Now the maximum likelihood estimator for each probability is \( \hat{p_{i}} = \frac{x_{i}}{n} \). But we can’t use that. That’s estimating four parameters. We need to estimate one parameter, theta. So we need to go back to the multinomial maximum likelihood function and define \( p_{i}\) in terms of theta. And of course we’ll work with the log likelihood since it’s easier to work with sums than products.

\( \log L(\theta) = y_{1} \log p_{1} + y_{2} \log p_{2} + y_{3} \log p_{3} + y_{4} \log p_{4} \)

If you’re not sure how I got that, google “log likelihood multinomial distribution” for more PDF lecture notes than you can ever hope to read.

Now let’s define the probabilities in terms of one parameter, theta:

\( \log L(\theta) = y_{1} \log f_{1}(\theta) + y_{2} \log f_{2}(\theta) + y_{3} \log f_{3}(\theta) + y_{4} \log f_{4}(\theta) \)

Now take the derivative. Once we have that we can set equal to 0 and find a solution for theta. The solution will be the point at which theta obtains its maximum value:

\( \frac{d \log L(\theta)}{d\theta} = \frac{y_{1}}{f_{1}(\theta)} f’_{1}(\theta) + \frac{y_{1}}{f_{2}(\theta)} f’_{2}(\theta) + \frac{y_{1}}{f_{3}(\theta)} f’_{3}(\theta) + \frac{y_{1}}{f_{4}(\theta)} f’_{4}(\theta)\)

Time to go from the abstract to the applied with our values. The y’s are the data from our table and the functions of theta are the results from the previous problem.

\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{1/4(2 + \theta)} \frac{1}{4} – \frac{117}{1/4(1 – \theta)}\frac{1}{4} – \frac{106}{1/4(1 – \theta)}\frac{1}{4} + \frac{381}{1/4(\theta)} \frac{1}{4} \)
\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{2 + \theta} – \frac{117}{1 – \theta} – \frac{106}{1 – \theta} + \frac{381}{\theta} \)
\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{2 + \theta} – \frac{223}{1 – \theta} + \frac{381}{\theta} \)

Set equal to 0 and solve for theta. Beware, it gets messy.

\( \frac{[1528(1 – \theta)\theta] – [223(2 + \theta)\theta] + [381(2 + \theta)(1 – \theta)]}{(2 + \theta)(1- \theta)\theta} = 0\)

Yeesh. Fortunately the denominator cancels out when we start multiplying terms and solving for theta. So we’re left with this:

\( 1528\theta – 1528\theta^{2} – 446\theta – 223\theta^{2} + 762 – 381\theta – 381\theta^{2} = 0\)

And that reduces to the following quadratic equation:

\( -2132\theta^{2} + 701\theta + 762 = 0\)

I propose we use an online calculator to solve this equation. Here’s a good one. Just plug in the coefficients and hit solve to find the roots. Our coefficients are a = -2132, b = 701, and c = 762. Since it’s a quadratic equation we get two answers:

\( x_{1} = -0.4556 \)
\( x_{2} = 0.7844 \)

The answer is in terms of a probability which is between 0 and 1, so we toss the negative answer and behold our maximum likelihood estimator for theta: 0.7844.

Remember that \( \theta = (1 – \pi)^{2}\). If we solve for pi, we get \( \pi = 1 – \theta^{1/2}\), which works out to be 0.1143. That is, we estimate the probability of characteristic genes switching chromosomes to be about 11%. Therefore we can think of theta as the probability of having two parents that experienced no gene switching.

Now point estimates are just estimates. We would like to know how good the estimate is. That’s where confidence intervals come in to play. Let’s calculate one for theta.

It turns out that we can estimate the variability of our maximum likelihood estimator as follows:

\( V(\theta) = \frac{1}{I(\theta)}\), where \( I(\theta) = -E(\frac{d^{2}\log L}{d \theta^{2}}) \)

We need to determine the second derivative of our log likelihood equation, take the expected value by plugging in our maximum likelihood estimator, multiply that by -1, and then take the reciprocal. The second derivative works out to be:

\( \frac{d^{2}\log L}{d \theta^{2}} = -\frac{1528}{(2 + \theta)^{2}} -\frac{223}{(1 – \theta)^{2}} -\frac{381}{\theta^{2}} \)

The negative expected value of the second derivative is obtained by plugging in our estimate of 0.7844 and multiplying by -1. Let’s head to the R console to calculate this:

th <- 0.7844 # our ML estimate
(I <- -1 * (-1528/((2+th)^2) - 223/((1-th)^2) - 381/(th^2))) # information
[1] 5613.731

Now take the reciprocal and we have our variance:

(v.th <- 1/I)
[1] 0.0001781347

We can take the square root of the variance to get the standard error and multiply by 1.96 to get the margin of error for our estimate. Then add/subtract the margin of error to our estimate for a confidence interval:

# confidence limits on theta
0.784+(1.96*sqrt(v.th)) # upper bound
[1] 0.8101596
0.784-(1.96*sqrt(v.th)) # lower bound
[1] 0.7578404

Finally we convert the confidence interval for theta to a confidence interval for pi:

# probability of switching over
th.ub <- 0.784+(1.96*sqrt(v.th))
th.lb <- 0.784-(1.96*sqrt(v.th))
1-sqrt(th.ub) # lower bound
[1] 0.09991136
1-sqrt(th.lb) # upper bound
[1] 0.1294597

The probability of genes switching chromosomes is most probably in the range of 10% to 13%.

A Probability Problem in Heredity

Here’s a fun problem in heredity from the Dover classic Principles of Statistics by M.G. Bulmer. It’s from chapter 2, problem 2.5.

The results of Table 7 on p. 25 (see below) can be explained on the assumption that the genes for flower colour and pollen shape are on the same chromosome but that there is a probability π that one of the genes will be exchanged for the corresponding gene on the other chromosome. If we denote the genes for purple or red flowers by P and p, and the genes for long and round pollen by L and l, then the hybrids from the cross considered will all be of the genotype PL/pl, the notation indicating that the P and L genes are on one chromosome and the p and l genes on the other. When these hybrids are allowed to self-fertilise, there is a chance π that the L and l genes will interchange in one parent, giving Pl/pL; there are therefore really three mating types, PL/pl X PL/pl, Pl/pL X PL/pl and Pl/pL x Pl/pL, which occur with probabilities \( (1 – \pi)^{2} \), \( 2\pi(1 – \pi) \) and \(\pi^{2}\) respectively. Find the probabilities of the four possible phenotypes resulting from the experiment in terms of \( \theta = (1 – \pi)^{2} \).

Here’s the table the problem refers to:

Purple-flowered Red-flowered Total
Long pollen 1528 117 1645
Round pollen 106 381 487
Total 1634 498 2132

The interesting thing about that table is that it violates Mendel’s law of independent assortment. If it obeyed that law, then the probabilities of the resulting phenotypes would be:

Purple-flowered Red-flowered
Long pollen \( \frac{9}{16} \) \( \frac{3}{16} \)
Round pollen \( \frac{3}{16} \) \( \frac{1}{16} \)

We’d expect the purple/long pollen flower to happen about 9/16 = 56% of the time. Instead we see it occurring about 1528/2132 = 72% of the time. As the problem explains this has to do with the way genes are carried on chromosomes. This means we can’t calculate probabilities as you normally would for a dihybrid cross. Therefore it asks us to calculate those probabilities conditional on whether or not the gene for pollen switched chromosomes. Our answer will take the form above, but instead of actual numbers, we’ll express our answer in terms of \( \theta = (1 – \pi)^{2} \). In other words theta is the probability that the gene for pollen did not switch chromosomes.

The hard part of this problem is setting it up. First, we have to recognize that we don’t know which chromosome has the characteristics. The characteristics of mating type PL/pl X PL/pl could both be on the PL chromosomes, in which case the result would be a PL, or a purple/long pollen flower. Or one could be PL and the other pl, which would still be PL, a purple/long pollen flower, since purple and long pollen are dominant characteristics. If both are on the pl chromosome, then the result would be pl, a red/round pollen flower. All told, for the PL/pl X PL/pl mating type we have the following possibilities:

PL/pl X PL/pl mating table (p = \( (1 – \pi)^{2}\) )
PL pl
PL PL PL
pl PL pl

For this mating type we get a purple/long pollen flower (PL) three out of four times and a red/round pollen flower (pl) one out of four times. We need to construct similar tables for the other two mating types. But notice the Pl/pL X PL/pl mating type can actually happen in reverse order as PL/pl X Pl/pL, so it needs two tables. Therefore we really need three more tables:

Pl/pL X PL/pl mating table (p = \( \pi(1 – \pi)\) )
PL pl
Pl PL Pl
pL PL pL
PL/pl X Pl/pL mating table (p = \( \pi(1 – \pi)\) )
Pl pL
PL PL PL
pl Pl pL
Pl/pL X Pl/pL mating table (p = \( \pi^{2}\) )
Pl pL
Pl Pl PL
pL PL pL

Have to give a quick shout out to http://truben.no/latex/table/ for helping me make those tables! Anyway…so we have 4 tables displaying 16 possible results:

  • 9 out of 16 times you get PL, a purple/long pollen flower
  • 3 out of 16 times you get Pl, a purple/round pollen flower
  • 3 out of 16 times you get pL, a red/long pollen flower
  • 1 out of 16 times you get pl, a red/round pollen flower

If the results were independent, we could just call those probabilities. But they’re not. That’s the whole point of the problem. We have to take into account the probability π of the exchange of genes from one chromosome to the other. Let’s reset the probabilities for the four mating types:

  1. PL/pl X PL/pl = \( (1-\pi)^{2} \)
  2. Pl/pL X PL/pl = \( \pi(1-\pi) \)
  3. PL/pl X pl/pL = \( \pi(1-\pi) \)
  4. Pl/pL X Pl/pL = \( \pi^{2} \)

So the probability of getting pl in the first mating type is \( \frac{1}{4}(1-\pi)^{2} \). Recall the problem asks us to find these probabilities in terms of \( \theta = (1 – \pi)^{2} \), so we can express this as \( \frac{1}{4}\theta \). And there’s one of our answers since pl does not occur in any of the other mating types.

Now we just need to find the other probabilities. To make life easier, let’s go ahead and convert all the probabilities in terms of \( \theta \):

  • \( (1 – \pi)^2 = \theta \)
  • \( (1 – \pi) = \theta^{1/2} \)
  • \( \pi = 1 – \theta^{1/2} \)
  • \( \pi^{2} = (1- \theta^{1/2})^{2} \)

The hardest one to find is PL:

\( PL = \frac{3}{4}\theta + \frac{2}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{2}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{2}{4}(1 – \theta^{1/2})^{2} \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2(1 – \theta^{1/2})\theta^{1/2} + 1 – \theta^{1/2} – \theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2(\theta^{1/2} – \theta) + 1 – 2\theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2\theta^{1/2} – 2\theta + 1 – 2\theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}(1 – \theta) \)
\( PL = \frac{3}{4}\theta + \frac{2}{4} – \frac{2}{4}\theta \)
\( PL = \frac{1}{4}\theta + \frac{2}{4} \)
\( PL = \frac{1}{4}(\theta + 2)\)

Next up is pL:

\( pL = \frac{1}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{1}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{1}{4}(1 – \theta^{1/2})^{2} \)
\( pL = \frac{1}{4}(\theta^{1/2} – \theta + \theta^{1/2} – \theta + 1 – \theta^{1/2} – \theta^{1/2} + \theta) \)
\( pL = \frac{1}{4}(-2\theta + 1 + \theta) \)
\( pL = \frac{1}{4}(1 – \theta) \)

That just leaves Pl. But if you look at the tables above, you’ll notice Pl appears in the same tables with pL the same number of times. So if we set up an equation to find its probability, we’ll get the same equation we started with when we solved Pl. That means we’ll get the same answer. So we don’t need to solve it. We already know it: \( pL = \frac{1}{4}(1 – \theta) \)

And that finishes the problem. The probabilities of the four possible phenotypes are as follows:

purple-flowered red-flowered
long pollen \( \frac{1}{4}(\theta + 2) \) \( \frac{1}{4}(1 – \theta) \)
round pollen \( \frac{1}{4}(1 – \theta) \) \( \frac{1}{4}\theta \)