Category Archives: Probability

Understanding qq-plots

qq-plot is short for quantile-by-quantile plot. You plot one quantile against another and you see if their coordinate pairs form a straight line. One of the quantiles is your sample observations placed in ascending order. For example, you take the height of 30 people and place them in order from smallest to largest. That will be your y-axis values. The other quantile will be the quantiles of a probability distribution that you believe your sample may have come from. For example, you may think your heights are Normally distributed. So your x-axis values would be the quantiles of a standard Normal distribution. Since you have 30 heights, you’ll need 30 Normal quantiles. The way that’s done is to find the quantiles for \frac{1}{31} , \frac{2}{31} ,\dots, \frac{30}{31} . In other words you solve for z in P(Z \le z) = \frac{1}{30}. That turns out to be -1.85. And then you do the same for all the fractions. Finally you plot your heights versus the z-values and see if they’re in a somewhat straight line. If they are, then you can be pretty safe in your assumption that your sample comes from a Normal distribution.

For Normal distributions this can be made intuitive by recalling the standard Normal transformation: z = \frac{x-\mu}{\sigma}. That’s the formula we use to create z-scores, which allow us to use the standard normal distribution to find probabilities. Solving for x we get x = z\sigma + \mu. You may notice that looks just like the formula for a straight line but with different variables. Remember this one from algebra? y = mx + b. That’s what we have but with the mean (\mu) serving as the intercept and the standard deviation (\sigma) as the slope. And instead of y and x, we have x and z. If we wanted to make a straight line, we would plug in our z-values into x = z\sigma + \mu and find our x values. That would give us a set of coordinate pairs that we could plot and then connect the dots to make a straight line.

Now of course we don’t know \mu and \sigma. We have to estimate them from our sample. Furthermore we don’t know if our sample came from a Normal distribution. But when constructing a Normal qq-plot, we assume our sample did come from a Normal distribution. We have our straight line formula x = z\sigma + \mu and our x values. If our x values came from a Normal distribution, then they would have corresponding z-scores. We don’t know what they are (because we don’t know \mu and \sigma), but we can estimate them with the quantiles of a standard normal distribution, with the number of quantiles equal to our sample size. Let’s do this in R. We’ll generate a random sample of 30 observations from a Normal distribution with mean = 10 and standard deviation = 2.

# generate random sample from N(10,2)
x <- sort(rnorm(30,10,2))
k <- (1:30)/31
z <- qnorm(k)
plot(z,x,main='Normal qq-plot')

That gives us the following plot:

The points look like they're on a straight line, as they should. R has a built in function for this called qqnorm. All you have to do is type:


And you get:

Which looks like what we did the first time but with axis labels.

If we want we can draw a straight line through the plot to help us gauge just how close to a straight line the points lie. One choice is to draw the least-squares line. Another is to use the formula x = z\sigma + \mu using our mean and standard deviation estimates for \mu and \sigma . In other words plot the line x = zs + \overline{x}. Let's do both:

# create qq-plot with straight lines
plot(z,x,main='Normal qq-plot')
abline(lm(x~z)) # draw best fitting line
plot(z,x,main='Normal qq-plot')
abline(a=mean(x),b=sd(x)) # draw line with slope = std dev and intercept = x-bar

Here's the plot:

It's really hard to tell the difference between the two, but the points seem to lie pretty close to both straight lines.

So that's how Normal qq-plots work. But keep in mind that qq-plots are not just for assessing Normality. You can assess whether or not a sample came from, say, a chi-square distribution or a gamma distribution. Let's generate 30 points from a chi-square distribution with 4 degrees of freedom, and then construct two qq-plots: one with chi-square quantiles and the other with Normal quantiles.

# generate random sample from chi-sq(df=4)
c <- sort(rchisq(30,df=4))
# compare normal and chi-square qq-plots
k <- (1:30)/31
q <- qnorm(k)
cq <- qchisq(k,df=4)
plot(cq,c,main='chi-square qq-plot')
abline(lm(c~cq)) # draw best fitting line
plot(q,c,main='Normal qq-plot')
abline(lm(c~q)) # draw best fitting line

This gives us the following plots:

We see in the left plot our two quantiles lying pretty close to a straight line. They should. That's the chi-square qq-plot and our data are from a chi-square distribution. On the right however we see a fit that's not so good, especially in the extremes. That's our Normal qq-plot. Since our data come from a chi-square distribution, which is skewed right, it makes sense that the Normal qq-plot would show large deviations from a straight line in the "tails" of the plot.

Independence vs. Mutually Exclusive

I used to tutor undergrad students in statistics when I was in grad school. One question that almost every student asked me was to explain the difference between “independence” and “mutually exclusive”. Of course there’s the probabilistic definitions:

If P(A \cap B) = P(A)P(B), then the events A and B are independent.

If A \cap B = \emptyset then the events A and B are mutually exclusive.

But that can be a little too abstract for a new statistics student. Here’s what I would tell them.

When we talk about independence, we’re talking about a series of events. The classic example is a series of coin flips. The outcome of the first flip (or first 10,000 flips) has no effect on the probability of the next flip. The flips are independent. The probability does not change from flip to flip. The same concept can be applied to answers to a particular question on a survey taken by a 100 people, or a vital sign taken from each patient in a clinical trial treatment group, or weight measurements of a random sample of candy bars rolling off an automated production line. We can think of all those as being independent measurements. Because someone in Chicago said “yes” to a survey question doesn’t change the probability that someone in Dallas will also respond “yes”. Again, we’re talking about a series of events and whether or not the probability of the event outcomes change based on earlier event outcomes. The counter-example to independence is drawing cards. The probability of drawing, say, a King changes from draw to draw if you do not replace the cards. If your first draw is a 10, you have a slightly better chance of drawing a King on the next draw because of the size of the deck decreased by 1 card.

When we talk about events being mutually exclusive, we’re talking about observing one outcome and whether or not some or all of the events can occur at the same time. If I flip a coin once, I will either observe a head or tail, not both. The events are mutually exclusive for that outcome. Same with rolling dice. One roll and I will observe one side. I can’t observe both a 2 and 5 on one roll. The events are mutually exclusive. A baby is born. It will either be a boy or a girl. The events are mutually exclusive. I take a sample of blood from someone to determine the blood type. It will be one type, say O-negative. It can’t be two types.

So that’s the main idea for a high-level distinction between the two:

  1. Independence deals with probability of outcomes over a series of events.
  2. Mutually exclusive deals with the possibility of events in a single outcome.

Now the ultimate gotcha question is can a pair of events be simultaneously mutually exclusive and independent? In other words, if two events are possible (ie, they have a probability greater than 0), then can they both be mutually exclusive and independent? The answer is No.

If A and B are mutually exclusive, then A \cap B = \emptyset, which implies P(A \cap B) = 0. If A and B are independent, then P(A \cap B) = 0, but both A and B have positive probability so that can’t be. So A and B are not independent.

Likewise, if A and B are independent, we have P(A \cap B) = P(A)P(B) > 0 since both events are possible. But as we saw before, if two events are mutually exclusive then P(A \cap B) = 0. So A and B can’t be mutually exclusive if they’re independent.


The forgetful exponential distribution

The exponential distribution has the quirky property of having no memory. Before we wade into the math and see why, let’s consider a situation where there is memory: drawing cards. Let’s say you have a well-shuffled deck of 52 cards and you draw a single card. What’s the probability of drawing an ace? Since there are 4 aces in a deck of 52 cards, the probability is \frac{4}{52}. We draw our card and it’s not an ace. We set the card aside, away from the deck, and draw again. Now our probability of drawing an ace is \frac{4}{51}. We have a slightly better chance on the 2nd draw. The condition that we have already selected a card that wasn’t an ace changes the probability we draw an ace. This doesn’t happen with the exponential distribution.

Let’s say we have a state-of-the-art widget (version 2.0) that has a lifespan that can be described with an exponential distribution. Further, let’s say the mean lifespan is 60 months, or 5 years. Thanks to the “no memory” property, the probability of the lifespan lasting 7 years is that same whether the widget is new or 5 years old. In math words:

P(X > 7 + 5 | X>5) = P(X>7)

That means if I bought a widget that was 5 years old, it has the same probability of lasting another 7 years as a brand new widget has for lasting 7 years. Not realistic but certainly interesting. Showing why this is the case is actually pretty straight-ahead.

We want to show that for the exponential distribution, P(X > x + y | X > x) = P(X > y).

Recall the cumulative distribution of an exponential distribution is P(X \le x)=F(x) = 1 -e^{-x/\theta}. That’s the probability of an event occurring before a certain time x. The complement of the cumulative distribution is the probability of an event occurring after a certain time:

P(X > x) = 1 - P(X \le x) = 1 - (1 - e^{-x/ \theta} ) = e^{-x/ \theta}

Also recall the definition of conditional probability: P(A |B) = \frac{P(A \cap B)}{P(B)}

Let’s plug into the equality we want to prove and see what happens:

P(X > x + y | X > x) = \frac{P(X>x+y) \cap P(X>x)}{P(X > x)} = \frac{P(X>x+y)}{P(X > x)} =\frac{e^{-(x+y)/\theta}}{e^{-x/\theta}} = \frac{e^{-x/\theta}e^{-y/\theta}}{e^{-x/\theta}} = e^{-y/\theta} = P(X>y)

There you go. Not too bad.

We can actually go the other direction as well. That is, we can show that if P(X > x + y | X > x) = P(X > y) is true for a continuous random variable X, then X has an exponential distribution. Here’s how:

P(X > x + y | X > x) = P(X > y) (given)

P(1 - F(x + y) | 1 - F(x)) = 1 - F(y) (substitute the cdf expressions)

\frac{1-F(x+y) \cap 1-F(x))}{1-F(x)}=1-F(y) (using the definition of conditional probability)

\frac{1-F(x+y)}{1-F(x)}=1-F(y) (If X > x + y, then X > x)

Now substitute in generic function terminology, say h(x) = 1 - F(x):


Rearranging terms gives us h(x+y)=h(y)h(x)

Now for that equality to hold, the function h(x) has to have an exponential form, where the variable is in the exponent, like this: a^{x}. Recall that a^{x}a^{y}=a^{x+y}. If h(x) = a^{x}, then our equality above works. So we let h(x)=a^{x}. That allows to make the following conclusion:

1-F(x) = h(x) = a^{x} = e^{ln a^{x}} = e^{x ln a}

Now let b = ln a. We get 1-F(x) = e^{bx}. Solving for F(x) we get F(x) = 1 - e^{bx}. Since F(\infty) = 1, b must be negative. So we have F(x) = 1 - e^{-bx}. Now we just let b = \frac{1}{\theta} and we have the cumulative distribution function for an exponential distribution: F(x) = 1 - e^{-x/\theta}.

That’s the memoryless property for you. Or maybe it’s called the forgetfulness property. I can’t remember.

Investigating the chi-square test of independence

Survey a random sample of 500 men and ask if they have seen Star Wars more than twice. Survey 500 women and ask the same question. We might guess that the proportion of people in each gender who say “yes” will differ significantly. In other words, the likelihood of answering the question one way or another depends on the gender. What if we surveyed a random sample of 500 men and 500 women and asked whether or not their birthday was in December? We probably wouldn’t expect gender to influence the likelihood of having a birthday in December.

In my first example, the two variables are…

  1. Have you seen Star Wars more than twice? (Yes/No)
  2. Gender (Male/Female)

In the second example the two variables are…

  1. Is your birthday in December? (Yes/No)
  2. Gender (Male/Female)

Our question is whether or not these variables are associated.The quick way to answer this question is the chi-square test of independence. There are many good web pages that explain how to do this test and interpret the results, so I won’t re-invent the iPhone. Instead, what I want to demonstrate is why it’s called the “chi-square” test of independence.

Let’s look at the statistic for this test:


k = the number of combinations possible between the two variables. For example,

  1. Male & seen Star Wars more than twice
  2. Male & have not seen Star Wars more than twice
  3. Female & seen Star Wars more than twice
  4. Female & have not seen Star Wars more than twice

O = the OBSERVED number of occurrences for each combination. Perhaps in my survey I found 12 females out of 500 who have seen Star Wars more than twice and 488 who have not. By contrast, I found 451 males who have seen it twice and 49 who have not.

E = the EXPECTED number of occurrences IF the variables were NOT associated.

The chi-square test of independence is a hypothesis test. Therefore we make a falsifiable assumption about the two variables and calculate the probability of witnessing our data given our assumption. In this case, the falsifiable assumption is that the two variables are independent. If the two variables really are independent of one another, how likely are we to see the data we just saw? That’s a rough interpretation of the p-value.

Now let’s use R to generate a bunch of data on two independent categorical variables (each with two levels),  run 1000 chi-square tests on these variables, and plot a histogram of the chi-square statistic:

p <- 0.4
n1 <- 500
n2 <- 500
n <- n1 + n2
x <- rep(NA,1000)
for (i in 1:1000){
	n11 <- sum(rbinom(n1,1,p))
	n21 <- n1 - n11
	n12 <- sum(rbinom(n2,1,p))
	n22 <- n2 - n12

	table <- matrix(c(n11,n21,n12,n22),nrow=2)
	test <- chisq.test(table,correct=F)

	x[i] <- test$statistic
hist(x, freq=F)


You can see I generated two binomial samples 500 times with n=1 and p = 0.4. You can think of it as flipping a coin 500 times (or asking 500 girls if they have seen Star Wars twice). I then took the sum of the "successes" (or girls who said yes). The probabilities for both samples are the same, so they really are independent of one another. After that I calculated the chi-square test statistic using R's chisq.test function and stored it in a vector that holds a 1000 values.  Finally R made us a nice histogram. As you can see, when you run the test 1000 times, you most likely will get a test statistic value below 4.

Now let's plot a chi-square distribution with one degree freedom on top of the histogram and show that the chi-square test statistic does indeed follow a chi-square distribution approximately. Note that I put approximately in italics. I really wanted you to see that. Oh, and why one degree of freedom? Because given the row and column totals in your 2 x 2 table where you store your results, you only need one cell to figure out the other three. For example, if I told you I had 1000 data points in my Star Wars study (500 men and 500 women), and that I had 650 "yes" responses to my question and 350 "no" responses, I would only need to tell you, say, that 21 women answered "yes" in order for you to figure out the other three values (men-yes, men-no, women-no). In other words, only one out of my four table cells is free to vary. Hence the one degree of freedom. Now about that histogram.

h <- hist(x, plot=F)
ylim <- range(0, h$density, 0.75)
hist(x, freq=F, ylim=ylim)
curve(dchisq(x,df=1), add = TRUE)

When I said it was approximate, I wasn't kidding was I? Below 1 the fit isn't very good. Above 2 it gets much better. Overall though it's not bad. It's approximate, and that's often good enough. Hopefully this helps you visually see where this test statistic gets its name. Often this test gets taught in a class or presented in a textbook with no visual aids whatsoever. The emphasis is on the tables and the math. But here I wanted to crawl under the hood and show you how the chi-square distribution comes into play.

Fume Cupboards

I was reading one of my favorite statistics books, Principles of Statistics, and came across a fun problem I decided to tackle. It involves fume cupboards, big hoods chemists sometimes work under to suck up noxious fumes. The problem isn’t that complicated, but it tripped me up and I thought I should write it about it for posterity so I don’t let it get me again. Here’s the problem (slightly condensed to save typing):

In an extensive survey of chemical research workers, it was found that on average each worker required no fume cupboard 60% of the time, one fume cupboard 30% of the time, and two fume cupboards 10% of the time. Three or more were never required. If a group of four chemists work independently of one another, what’s the minimum number of fume cupboards required to provide adequate facilities at least 95% of the time?

The plan of attack came to me right away. We need to determine the probability the four chemists will – all together – need 0 fume cupboards, 1 fume cupboard, 2 fume cupboards, all the way up to 8 fume cupboards, and find the minimum number that exceeds 95%. The tricky part was counting all the different ways the four chemists can use fume cupboards.

First let’s determine how many combinations of fume-cupboard-needs can arise. One way to contemplate this is to think of a combination lock with four dials, kind of like an old bicycle lock. In this case the dials would have values of 0, 1, and 2. So let’s say chemists 1 and 2 needed one fume cupboard, and chemists 3 and 4 needed none, then our combination would look like this:

1 1 0 0

That sums to 2 and means we would need two fume cupboards to keep the chemists happy, well-ventilated and working. How many possible combinations can we have? The quick answer is 3^{4}=81. That’s three possible states of need for each chemist, all multiplied together.

Next let’s consider the minimum and maximum number of fume cupboards these chemists could possibly need. It’s possible they could all be at their desk and need 0 fume cupboards. On the other hand, they could all simultaneously need two fume cupboards for their latest experiment and thus require a total of eight. So at any given time when all four chemists are at work, they could need anywhere from 0 to 8 fume cupboards.

The probability they all need 2 fume cupboards at the same time is 0.1^{4}=0.0001. So if their company provided just 7 cupboards, they would have adequate facilities 99.99% of the time. However, the problem asks us to find the minimum number such that they have what they need 95% of the time. So we know the answer is not 8 or 7.

What’s the probability they need 7 fume cupboards? Well the first, second and third chemists could need 2, and the fourth one. The probability of that happening is 0.1^{3}\times 0.3 . But the first chemist could need one and the chemists 2-4 could need two. Or the 2nd chemist could need one and chemists 1, 3, and 4 need two. Or the 3rd chemist could need one and chemists 1, 2, and 4 need two. That’s four different ways the chemists could need 7 fume cupboards. So our probability becomes 0.1^{3}\times 0.3 \times 4 = 0.0012.

Since the events “need 7 fume cupboards” and “need 8 fume cupboards” are mutually exclusive, we can add them to obtain the probability the chemists would need 7 or more fume cupboards. That comes to 0.0013. That means if we had 6 fume cupboards (let’s call them FCs, I’m tired of typing fume cupboard), we would have adequate coverage 99.87% of the time.

Now we’re hitting the tricky part of this problem: counting the various combinations of FCs. The extremes aren’t so bad. There’s one way to need 0 and 8 FCs. There are four ways to need 1 and 7 FCs. But when we get to 2-6, we have to be careful how we count. If we undercount our probability is too low. Overcount and we go too high. One way to check ourselves is to see if we have come up with 81 possible scenarios. There are 10 possibilities accounted for when we consider 0, 1 , 7 , and 8 FCs. That means there are 71 ways for the four chemists to need 2 – 6 FCs.

What stymied me was counting the possible ways the chemists could need 3, 4 and 5 FCs. It would help if I could draw a picture in real time and explain it. But I can’t, so I’ll just show you the answer:

P(0 FCs) =0.6^{4} = 0.1296
P(1 FCs) =0.3*(0.6^{3})*4 = 0.2592
P(2 FCs) =(0.3^{2})*(0.6^{2})*6+0.1*(0.6^{3})*4 = 0.2808
P(3 FCs) =0.1*0.3*(0.6^{2})*12+(0.3^{3})*0.6*4 = 0.1944
P(4 FCs) =(0.1^{2})*(0.6^{2})*6+(0.3^{4})+0.1*(0.3^{2})*0.6*12 = 0.0945
P(5 FCs) =(0.1^{2})*0.3*0.6*12+0.1*(0.3^{3})*4 = 0.0324
P(6 FCs) =(0.1^{3})*0.6*4+(0.1^{2})*(0.3^{2})*6 = 0.0078
P(7 FCs) =(0.1^{3})*0.3*4 = 0.0012
P(8 FCs) =0.1^{4} = 0.0001

Notice the probabilities sum to 1:
0.1296 + 0.2592 + 0.2808 + 0.1944 + 0.0945 + 0.0324 + 0.0078 + 0.0012 + 0.0001 = 1

Also notice the multiples of the various combinations sum to 81:
Number of ways to require 0 or 8 FCs = 2
Number of ways to require 1 and 7 FCs = 8
Number of ways to require 2 and 6 FCs = 20
Number of ways to require 3 and 5 FCs = 32
Number of ways to require 4 FCs = 19

All that’s left now is to find the minimum number of FCs that’s at least 95%. It turns out to be 4:

P(4 or fewer FCs needed) = 0.1296 + 0.2592 + 0.2808 + 0.1944 + 0.0945 = 0.9585

Since a good industrial FC costs well over $10,000, one can see how a problem such as this has use in real life (provided our initial estimates of FC use is reasonably accurate).

How many rolls of a die to see all sides at least once?

Say you have a fair die. (Most of them are these days.) You roll it once and observe a 3. You roll it again and observe a 1. You roll it yet again and get a 4. In three throws you’ve observed 3 of the possible sides. What if you keep going? How many rolls would it take to see all six sides at least once. Obviously the minimum number is 6. But the probability of that happening is about 0.015. (1 \times \frac{5}{6}\times \frac{4}{6}\times \frac{3}{6}\times \frac{2}{6}\times \frac{1}{6}). What is the most likely number of rolls? Or put in the language of a statistics textbook, what’s the expected number of rolls?

To solve this problem we need to use the Geometric Distribution. If a series of trials are independent of one another, and each trial has the same probability of success, then the number of trials needed to observe the first success has a geometric distribution. This is the scenario of our roll of the die. Each roll is independent of one another and the probability of success remains the same for each roll, whatever we define “success” to be. The mean of a geometric distribution is \frac{1}{p}. That means the expected number of times we need to roll a dice to observe, say, a four is 6.

Back to our problem. We’re certain to get at least one of the faces on the first roll. So that probability is 1. On the second roll we have a probability of \frac{5}{6} of getting a different face than what we got on the first roll. It’s not certain. There is a probability of \frac{1}{6} of getting the same result as the first roll. Once we have observed two sides, the probability of observing a third side becomes \frac{4}{6}. On so on. Notice that as we observe previously unseen sides the definition of success changes, therefore the probability changes. So to determine the expected number of rolls to see all sides of a die at least once, we need to find the expected number of rolls for each definition of “success” and sum them. The steps below describe the process. (The numbers represent the order of the observed sides, not the values on the dice.)

  1. We’re guaranteed to see one of the sides on the first roll. Expected number of rolls: 1
  2. We have \frac{5}{6} probability of seeing a different side than what was previously observed. Expected number of rolls: 1/\frac{5}{6}=1.2
  3. We have \frac{4}{6} probability of seeing a different side than what was previously observed in steps 1 and 2. Expected number of rolls: 1/\frac{4}{6}=1.5
  4. We have \frac{3}{6} probability of seeing a different side than what was previously observed in steps 1 -3. Expected number of rolls: 1/\frac{3}{6}=2
  5. We have \frac{2}{6} probability of seeing a different side than what was previously observed in steps 1-4. Expected number of rolls: 1/\frac{2}{6}=3
  6. We have \frac{1}{6} probability of seeing a different side than what was previously observed in steps 1-5. Expected number of rolls: 1/\frac{1}{6}=6

Adding all the expected number of rolls for each definition of success we get 14.7. So we expect to roll a die about 15 times on average  before we observe all sides at least once.

We can use this same line of thinking to answer such “real-life” questions as how many boxes of cereal would you have to buy to collect the latest series of cheap plastic toys, assuming the toys are randomly added to the cereal in a uniform manner. Of course if you’re desperate to collect the precious toys and have a modicum of patience, the answer is probably 0. You’d just wait to buy the whole set on eBay.