Category Archives: Probability

Revisiting some old R-code

One of my first blog posts on this site simulated rolling a die 6 times and observing if side i was observed on the ith^ roll at least once. (For example, rolling a 3 on my 3rd roll would be a success.) I got the idea from one of my statistics textbooks which had a nice picture of the simulation converging to the true probability of 0.665 over the course of 500 trials. I was able to recreate the simulation in R and I guess I got excited and blogged about it. I was reminded of this today when I opened that old textbook and came across the R code that I had actually written in the margin by hand. Apparently I was super proud of my ground-breaking R code and decided to preserve it for posterity in writing. smh

Over 5 years later my precious R code looks pretty juvenile. It's way more complicated than it needs to be and doesn't take advantage of R's vectorized calculations. As Venables and Ripley say in MASS, “Users coming to S from other languages are often slow to take advantage of the power of S to do vectorized calculations…This often leads to unnecessary loops.” Indeed. I'm living proof of that.

I shouldn't be too hard on my myself though. I was not yet familiar with functions like replicate and cumsum. And I was more focused on recreating the plot than writing optimal R code. I went with what I knew. And R, so forgiving and flexible, accommodated my novice enthusiasm.

Here is how I would approach the problem today:

r.out <- replicate(n = 500, any(sample(1:6, size = 6, replace = T) == 1:6))
p.out <- cumsum(r.out)/seq(500)
plot(x = seq(500), y = p.out, type = "l", ylim = c(0,1), 
     main = "Convergence to probability as n increases", xlab = "n")
abline(h = 0.665)

plot of chunk unnamed-chunk-9

On line 1, we use the sample function to “roll a die 6 times” by sampling the numbers 1 – 6, with replacement, 6 times. Then we compare the 6 results with the vector of numbers 1 – 6 using the == operator and use the any function to check if any are TRUE. Next we replicate that 500 times and store the result in r.out. This is a vector of TRUE/FALSE values which R treats numerically as 1 and 0. This means we can use cumsum to find the cumulative sum of successes. To determine the cumulative proportion of successes, we divide each cumulative sum by the trial number. The result is a vector of porportions that should start converging to 0.665. Finally we plot using base R plot and abline.

This is more efficient than my original attempt 5 years ago and better captures the spirit of the simulation. I'm sure 5 years from now if I stumble upon this post I'll have yet another more elegant way to do it. I'm already looking at it thinking, “I should have generalized this with a function, and used ggplot2 to make the graph. And I shouldn't do seq(500) twice.” In fact I know I could have avoided the replicate function by using the fact that there's a probablity of \(\frac{1}{6}\) of observing side i on the ith^ roll of a die. So I could have used a single rbinom call to do the simulation, like so:

r.out2 <- cumsum(rbinom(n = 500, size = 6, prob = 1/6) > 0)
p.out2 <- r.out2/seq(500)
plot(x = seq(500), y = p.out2, type = "l", ylim = c(0,1),
     main = "Convergence to probability as n increases", xlab = "n")
abline(h = 0.665)

plot of chunk unnamed-chunk-10

In this version instead of simulating 6 literal die rolls, we simulate the number of successes in 6 die rolls. We turn each roll of the die into a binomial event: success or failure. The rbinom function allows us to simulate binomial events where size is the number of trials (or rolls in this case) and prob is the probability of success at each trial. So rbinom(n = 1, size = 6, prob = 1/6) would return a number ranging 0 to 6 indicating the number of success. Think of it as flipping 6 coins, each with probability of getting heads as \(\frac{1}{6}\), and then counting the number of heads we observed. Setting the n argument to 500 replicates it 500 times. After that it's simply a matter of logically checking which outcomes were greater than 0 and using cumsum on the resulting TRUE/FALSE vector.

This version is way faster. I mean way faster. Compare the time it takes it to do each 1,000,000 times:

system.time({
r.out <- replicate(n = 1e6, any(sample(1:6, size = 6, replace = T) == 1:6))
p.out <- cumsum(r.out)/seq(1e6)
})
##    user  system elapsed 
##    5.26    0.00    5.26
system.time({
r.out2 <- cumsum(rbinom(n = 1e6, size = 6, prob = (1/6)) > 0)
p.out2 <- r.out2/seq(1e6)
})
##    user  system elapsed 
##    0.06    0.00    0.06

It's not even close. Who was the dummy that wrote that first version with replicate?

But does the new faster version reflect the experimental setting better? Not really. Remember, we're demonstrating probability concepts with die rolls in the first chapter of an intro stats textbook. That's probably not the best time to break out rbinom. And the demo was for 500 trials, not 1,000,000. I had to ramp up the trials to see the speed difference. Maybe the “right” R code in this situation is not the fastest version but rather the one that's easier to understand.

Profile likelihood ratio confidence intervals

When you fit a generalized linear model (GLM) in R and call confint on the model object, you get confidence intervals for the model coefficients. But you also get an interesting message:

Waiting for profiling to be done...

What's that all about? What exactly is being profiled? Put simply, it's telling you that it's calculating a profile likelihood ratio confidence interval.

The typical way to calculate a 95% confidence interval is to multiply the standard error of an estimate by some normal quantile such as 1.96 and add/subtract that product to/from the estimate to get an interval. In the context of GLMs, we sometimes call that a Wald confidence interval.

Another way to determine an upper and lower bound of plausible values for a model coefficient is to find the minimum and maximum value of the set of all coefficients that satisfy the following:

\[-2\log\left(\frac{L(\beta_{0}, \beta_{1}|y_{1},…,y_{n})}{L(\hat{\beta_{0}}, \hat{\beta_{1}}|y_{1},…,y_{n})}\right) < \chi_{1,1-\alpha}^{2}\]

Inside the parentheses is a ratio of likelihoods. In the denominator is the likelihood of the model we fit. In the numerator is the likelihood of the same model but with different coefficients. (More on that in a moment.) We take the log of the ratio and multiply by -2. This gives us a likelihood ratio test (LRT) statistic. This statistic is typically used to test whether a coefficient is equal to some value, such as 0, with the null likelihood in the numerator (model without coefficient, that is, equal to 0) and the alternative or estimated likelihood in the denominator (model with coefficient). If the LRT statistic is less than \(\chi_{1,0.95}^{2} \approx 3.84\), we fail to reject the null. The coefficient is statisically not much different from 0. That means the likelihood ratio is close to 1. The likelihood of the model without the coefficient is almost as high the model with it. On the other hand, if the ratio is small, that means the likelihood of the model without the coefficient is much smaller than the likelihood of the model with the coefficient. This leads to a larger LRT statistic since it's being log transformed, which leads to a value larger than 3.84 and thus rejection of the null.

Now in the formula above, we are seeking all such coefficients in the numerator that would make it a true statement. You might say we're “profiling” many different null values and their respective LRT test statistics. Do they fit the profile of a plausible coefficient value in our model? The smallest value we can get without violating the condition becomes our lower bound, and likewise with the largest value. When we're done we'll have a range of plausible values for our model coefficient that gives us some indication of the uncertainly of our estimate.

Let's load some data and fit a binomial GLM to illustrate these concepts. The following R code comes from the help page for confint.glm. This is an example from the classic Modern Applied Statistics with S. ldose is a dosing level and sex is self-explanatory. SF is number of successes and failures, where success is number of dead worms. We're interested in learning about the effects of dosing level and sex on number of worms killed. Presumably this worm is a pest of some sort.

# example from Venables and Ripley (2002, pp. 190-2.)
ldose <- rep(0:5, 2)
numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)
sex <- factor(rep(c("M", "F"), c(6, 6)))
SF <- cbind(numdead, numalive = 20-numdead)
budworm.lg <- glm(SF ~ sex + ldose, family = binomial)
summary(budworm.lg)
## 
## Call:
## glm(formula = SF ~ sex + ldose, family = binomial)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.10540  -0.65343  -0.02225   0.48471   1.42944  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4732     0.4685  -7.413 1.23e-13 ***
## sexM          1.1007     0.3558   3.093  0.00198 ** 
## ldose         1.0642     0.1311   8.119 4.70e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 124.8756  on 11  degrees of freedom
## Residual deviance:   6.7571  on  9  degrees of freedom
## AIC: 42.867
## 
## Number of Fisher Scoring iterations: 4

The coefficient for ldose looks significant. Let's determine a confidence interval for the coefficient using the confint function. We call confint on our model object, budworm.lg and use the parm argument to specify that we only want to do it for ldose:

confint(budworm.lg, parm = "ldose")
## Waiting for profiling to be done...
##     2.5 %    97.5 % 
## 0.8228708 1.3390581

We get our “waiting” message though there really was no wait. If we fit a larger model and request multiple confidence intervals, then there might actually be a waiting period of a few seconds. The lower bound is about 0.8 and the upper bound about 1.32. We might say every increase in dosing level increase the log odds of killing worms by at least 0.8. We could also exponentiate to get a CI for an odds ratio estimate:

exp(confint(budworm.lg, parm = "ldose"))
## Waiting for profiling to be done...
##    2.5 %   97.5 % 
## 2.277027 3.815448

The odds of “success” (killing worms) is at least 2.3 times higher at one dosing level versus the next lower dosing level.

To better understand the profile likelihood ratio confidence interval, let's do it “manually”. Recall the denominator in the formula above was the likelihood of our fitted model. We can extract that with the logLik function:

den <- logLik(budworm.lg)
den
## 'log Lik.' -18.43373 (df=3)

The numerator was the likelihood of a model with a different coefficient. Here's the likelihood of a model with a coefficient of 1.05:

num <- logLik(glm(SF ~ sex + offset(1.05*ldose), family = binomial))
num
## 'log Lik.' -18.43965 (df=2)

Notice we used the offset function. That allows us to fix the coefficient to 1.05 and not have it estimated.

Since we already extracted the log likelihoods, we need to subtract them. Remember this rule from algebra?

\[\log\frac{M}{N} = \log M – \log N\]

So we subtract the denominator from the numerator, multiply by -2, and check if it's less than 3.84, which we calculate with qchisq(p = 0.95, df = 1)

-2*(num - den)
## 'log Lik.' 0.01184421 (df=2)
-2*(num - den) < qchisq(p = 0.95, df = 1)
## [1] TRUE

It is. 1.05 seems like a plausible value for the ldose coefficient. That makes sense since the estimated value was 1.0642. Let's try it with a larger value, like 1.5:

num <- logLik(glm(SF ~ sex + offset(1.5*ldose), family = binomial))
-2*(num - den) < qchisq(p = 0.95, df = 1)
## [1] FALSE

FALSE. 1.5 seems too big to be a plausible value for the ldose coefficient.

Now that we have the general idea, we can program a while loop to check different values until we exceed our threshold of 3.84.

cf <- budworm.lg$coefficients[3]  # fitted coefficient 1.0642
cut <- qchisq(p = 0.95, df = 1)   # about 3.84
e <- 0.001                        # increment to add to coefficient
LR <- 0                           # to kick start our while loop 
while(LR < cut){
  cf <- cf + e
  num <- logLik(glm(SF ~ sex + offset(cf*ldose), family = binomial))
  LR <- -2*(num - den)
}
(upper <- cf)
##    ldose 
## 1.339214

To begin we save the original coefficient to cf, store the cutoff value to cut, define our increment of 0.001 as e, and set LR to an initial value of 0. In the loop we increment our coefficient estimate which is used in the offset function in the estimation step. There we extract the log likelihood and then calculate LR. If LR is less than cut (3.84), the loop starts again with a new coefficient that is 0.001 higher. We see that our upper bound of 1.339214 is very close to what we got above using confint (1.3390581). If we set e to smaller values we'll get closer.

We can find the LR profile lower bound in a similar way. Instead of adding the increment we subtract it:

cf <- budworm.lg$coefficients[3]  # reset cf
LR <- 0                           # reset LR 
while(LR < cut){
  cf <- cf - e
  num <- logLik(glm(SF ~ sex + offset(cf*ldose), family = binomial))
  LR <- -2*(num - den)
}
(lower <- cf)
##    ldose 
## 0.822214

The result, 0.822214, is very close to the lower bound we got from confint (0.8228708).

This is a very basic implementation of calculating a likelihood ratio confidence interval. It is only meant to give a general sense of what's happening when you see that message Waiting for profiling to be done.... I hope you found it helpful. To see how R does it, enter getAnywhere(profile.glm) in the console and inspect the code. It's not for the faint of heart.

I have to mention the book Analysis of Categorical Data with R, from which I gained a better understanding of the material in this post. The authors have kindly shared their R code at the following web site if you want to have a look: http://www.chrisbilder.com/categorical/

To see how they “manually” calculate likelihood ratio confidence intervals, go to the following R script and see the section “Examples of how to find profile likelihood ratio intervals without confint()”: http://www.chrisbilder.com/categorical/Chapter2/Placekick.R

Earthquake data and Benford’s Law

Much has been written about Benford's Law, that weird phenonmenon where if you have a naturally occuring set of numerical data, 30% of the numbers will begin with 1, 18% will begin with 2, 12% will begin with 3, and so on. You might expect the distribution of leading digits to be uniformly distributed, but no, that just isn't the case. 60% of the time the leading digit is 1, 2, or 3. Though it may seem like some curious mathematical anomaly, Benford's Law has been used to detect fraud in elections and accounting.

In this post let's use R to verify that earthquake data obtained from the USGS does indeed follow Benford's Law. On May 28, 2016, I downloaded data for all earthquakes in the past 30 days. The data has a number of fields, including earthquake magnitude, depth of the earthquake, location, time and many others. Benford's Law says those numbers should follow a particular distribution.

The formula for Benford's Law is as follows:

\[P(d) = \log_{10} \left(1+\frac{1}{d}\right) \]

That says the probability that a digit d occurs as the first number is equal the log base 10 of 1 + 1/d. We can quickly generate these for d = 1 – 9 in R:

log10(1 + (1/(1:9)))
## [1] 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 0.06694679
## [7] 0.05799195 0.05115252 0.04575749

And we can make a quick plot as follows:

barplot(log10(1 + (1/(1:9))), names.arg = 1:9, main = "Benford's Law")

plot of chunk unnamed-chunk-2

So according to this law, if we look at the distribution of first digits in our earthquake data, we should see them closely follow this distribution. Let's find out!

First we need to import the data. Thanks to the USGS, this data comes ready for analysis. All we need to do is read it into R:

dat <- read.csv("all_month.csv")
nrow(dat)
## [1] 8584

Over 8500 earthquakes in the past 30 days! A quick look at the magnitude shows us most of them are very small:

summary(dat$mag)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -1.900   0.790   1.280   1.526   1.930   7.200      20

This also tells us we have some negative numbers (?) as well as some missing data, NA's. We also have numbers that start with 0 and have a decimal, such as 0.79. (In this case, Benford would say the number “starts” with a 7, not a 0.) This means when we determine the leading digits, we'll need to ignore negative signs, leading 0s, decimals and missing values.

Let's investigate at the mag column. First we remove the NA's. Below is.na(dat$mag) generates a logical vector of TRUE and FALSE values. Adding an ! in front reverses the TRUE and FALSE values. Finally inserting the result in subsetting brackets returns only those values that are TRUE, ie not missing.

digits <- dat$mag[!is.na(dat$mag)]

Next we extract the first digits. To help with this I use two R packages, magrittr and stringr. The magrittr package allows us to “chain” commands together with the %>% operator. The stringr package provides the str_extract function that allows us to extract phrases that follow a certain pattern. So what we have below can be translated as follows:

  • take the absoluet value of digits (to get rid of negative signs)
  • convert digits to character (so we can use the next two functions)
  • extract anything that is not a “0” or a “.” (We express this a regular expression: "[^0\\.]")
  • extract the first digit (starting and ending at position 1)
library(magrittr)
library(stringr)
digits <- abs(digits) %>% 
  as.character() %>% 
  str_extract(pattern = "[^0\\.]") %>% 
  substr(1,1)

As an extra precaution, we then convert digits to a factor and set levels to be all digits 1 through 9. This ensure all digits are represented in subsequent calculations.

digits <- factor(digits, levels = 1:9) # ensure all digits represented

And finally we tally the first digits and calculate proportions:

table(digits) %>% prop.table()
## digits
##          1          2          3          4          5          6 
## 0.42800141 0.17134139 0.05738763 0.09447248 0.05809177 0.03990142 
##          7          8          9 
## 0.04459570 0.05304542 0.05316277

We see 1 appearing a lot more often, but it's hard to tell how the other digits compare to Benford's Law. Let's put both distributions on the same graph. To do this, I went ahead and created a function that will allow us to check any vector of numbers against Benford's Law. Let's load the function and see how it works, then we'll break it down and explain it.

library(ggplot2)
compareBenford <- function(d){
  digits <- d[!is.na(d)]
  digits <- substr(stringr::str_extract(as.character(abs(digits)), pattern = "[^0\\.]"),1,1)
  digits <- factor(digits, levels = 1:9) # ensure all digits represented
  depth <- prop.table(table(digits))
  ben <- log10(1 + (1/(1:9)))
  dat2 <- data.frame(ben, depth)
  names(dat2) <- c("Benford","Digit",deparse(substitute(d)))
  dat2L <- reshape2::melt(dat2,id.vars="Digit", variable.name = "Type", value.name = "Frequency")
  ggplot(dat2L, aes(x=Digit, y=Frequency, fill=Type)) + 
    geom_bar(stat = "identity", position = "dodge")
}

compareBenford(dat$mag)

plot of chunk unnamed-chunk-9

We see dat$mag has more 1's than we might expect and fewer 3's, but otherwise seems to follow the distribution pretty closely. Let's check out earthquake depth.

compareBenford(dat$depth)

plot of chunk unnamed-chunk-10

This appears to fit even better.

About the function:

  • the first four lines are what we did before. Notice I went ahead and nested the functions instead of using magrittr. That's how I originally coded it before deciding to write a blog post. Then I decided to break out magrittr for the fun of it.
  • After that I calculate Benford's proportions.
  • Then I put both sets of proportions in a data frame.
  • Then I change the names of the data frame. Notice there are three names to change, not two. Why? The depth vector is actually a table. When it gets pulled into a data frame, two columns are produced: the table cells and the table names. The names are the digits 1 – 9. Notice I also use deparse(substitute(d)) to name the first-digit proportions in the data frame. This ensures that whatever vector I give my function, the name of it will be the vector itself. So if I give it dat$mag, the column name of the first-digit proportions will be dat$mag.
  • Next I reshape the data using the melt function in the reshape2 package. This puts all the proportions in one column called Frequency and the type of proportion in another column called Type. I still have my Digit column that indicates which proportion goes with which digit. Having my data in this form allows me to use ggplot to map fill color to Type.
  • Finally I create the graph. I set Digit on the x-axis, Frequency on the y-axis, and map Type to the fill color. The geom_bar function says draw a bar plot. The stat = "identity" and position = "dodge" arguments say set the height of the bars to the actual values and “dodge” them so they're next to one another.

Let's look at ALL numeric columns in the earthquakes data. We need to identify all numeric columns and pull them into a vector for our function to work. Here's how I do it:

theCols <- sapply(dat, class) %in% c("numeric","integer")
numCols <- names(dat)[theCols]
allDigits <- unlist(dat[,numCols])

compareBenford(allDigits)

plot of chunk unnamed-chunk-11

Not a bad fit. Benford's Law seems to work just fine for the earthquake data.

But what about data where Benford's Law doesn't work? Recall that Benford's Law applies to most naturally occuring sets of numbers. Not all. Take the iris data that come with R. These are measurements of 50 flowers from each of 3 species of iris. This is a small data set looking at very specific measurements. It doesn't seem to follow Benford's Law very well.

irisDigits <- unlist(iris[,1:4])
compareBenford(irisDigits)

plot of chunk unnamed-chunk-12

Permutation Tests

Let's talk about permutation tests and why we might want to do them.

First think about the two-sample t-test. The null hypothesis of this test is that both samples come from the same distribution. If we assume both samples come from the same approximately normal distribution, we can use math formulas based on probability theory to calculate a test statistic. We can then calculate the probability of getting such a test statistic (or one greater) under the assumption of both samples coming from the same distribution. For example, let's draw two samples from the same normal distribution and run a t-test.

set.seed(135)
s1 <- round(rnorm(10, 100, 16))
s2 <- round(rnorm(10, 100, 16))
t.test(s1, s2, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  s1 and s2
## t = 0.8193, df = 18, p-value = 0.4233
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.508 17.108
## sample estimates:
## mean of x mean of y 
##     102.4      97.6

As expected the p-value is high, telling us the difference in their means is not significant. Another interpretation is that the resulting t-test statistic (0.8193) would be likely if both samples came from the same distribution.

In real life we don't know with certainty the distribution of the population from which our samples are drawn. We also don't know the variance of the distribution or whether the distribution is even symmetric. We also may be stuck with a small sample size. Fortunately the t-test is pretty robust and usually reliable even when its assumptions are wrong. However, if you have your doubts, you can try a permutation test.

In the case our two-sample example above, the permutation test takes all possible combinations of group membership and creates a permutation distribution. In other words, if we assume both samples came from the same population, a data point in group 1 is just as likely to appear in group 2. If we determine all possible permutations, we can calculate our statistic of interest for each permutation and create a distribution. We can then assess where our original statistic falls in this distribution. If it's in the tails then we have evidence that our two samples come from two different populations. Now if your sample sizes are bigger than say, 15, creating the permutations can get computationally expensive very fast. So this is probably not something you want to do unless you're dealing with small samples.

Here's how we can do a permutation test “by hand” in R with our fake data:

d0 <- mean(s1) - mean(s2)
alls <- c(s1, s2)  # combine into one vector
N <- length(alls)
n <- length(s1)
p <- combn(N, n)  # generate all combinations of n chosen from N
dmeans <- numeric(dim(p)[2])  # vector to store means
for (i in 1:dim(p)[2]) {
    dmeans[i] <- mean(alls[p[, i]]) - mean(alls[-p[, i]])
}
# plot histogram of all possible difference in means with lines indicating
# our original difference
hist(dmeans)
abline(v = d0, lty = 2)
abline(v = -d0, lty = 2)

plot of chunk unnamed-chunk-2

The combn function generates a matrix of all possible index combinations of n chosen from N. I say “index” because combn doesn't return the values themselves but rather their index position in the original vector. Thus we can use it to select all possible combinations from our combined vector, which is what we do in the for loop. In this case there are 184756 possible combinations. Once all possible differences in means are calculated we can plot a histogram of the differences and draw some lines to indicate where our original sample falls. Here I drew two lines to indicate the absolute value of the difference, the equivalent of a two-sided test. The probability (or p-value) of getting a difference more extreme than our original sample is pretty high. We can see this because there's a lot of histogram on either side of our lines. We can also calculate this value:

# two-sided p-value
signif((sum(dmeans <= -abs(d0)) + sum(dmeans >= abs(d0)))/dim(p)[2])
## [1] 0.4291

Very close to the p-value returned by the t-test.

Of course there are packages for this sort of thing. The one I know of is perm. Here's how we use its permTS function to replicate our results above:

library(perm)
permTS(s1, s2, alternative = "two.sided", method = "exact.ce", 
control = permControl(tsmethod = "abs"))
## 
##  Exact Permutation Test (complete enumeration)
## 
## data:  s1 and s2
## p-value = 0.4291
## alternative hypothesis: true mean s1 - mean s2 is  0
## sample estimates:
## mean s1 - mean s2 
##               4.8

The method= argument tells the function to do “complete enumeration”. The control= argument says how to calculate the p-value (i.e., the same way we did it “by hand”).

So that's a literal two-sample permutation test. As I mentioned, if your samples are large then this approach is not feasible as the number of permutations grows out of control. For example, two groups of size 20 results in 137,846,528,820 combinations. So we usually resort to resampling methods. This is where we repeatedly resample from our combined vector of values to get a large number of combinations. We don't generate all combinations, but enough to give us a good idea. Here's one way to do it:

dmeans <- numeric(2000)  # vector to store means
for (i in 1:2000) {
    g <- sample(N, n)
    dmeans[i] <- mean(alls[g]) - mean(alls[-g])
}
hist(dmeans)
abline(v = d0, lty = 2)
abline(v = -d0, lty = 2)

plot of chunk unnamed-chunk-5

signif((sum(dmeans <= -abs(d0)) + sum(dmeans >= abs(d0)))/2000)
## [1] 0.4335

So out of 1.8476 × 105 possible permutations we only calculated 2000 (assuming no repeats) and we still got a p-value pretty close to that of our literal permutation test. Not bad.

We can do the same using the permTS function as follows:

permTS(s1, s2, alternative = "two.sided", 
method = "exact.mc", control = permControl(nmc = 2000, 
    setSEED = FALSE, tsmethod = "abs"))
## 
##  Exact Permutation Test Estimated by Monte Carlo
## 
## data:  s1 and s2
## p-value = 0.4213
## alternative hypothesis: true mean s1 - mean s2 is  0
## sample estimates:
## mean s1 - mean s2 
##               4.8 
## 
## p-value estimated from 2000 Monte Carlo replications
## 99 percent confidence interval on p-value:
##  0.3925 0.4498

The method “exact.mc” says use Monte Carlo simulation. The nmc= argument specifies number of replications (the default is 999). The setSEED= argument says not to set a seed. I want different random replications each time I run this line of code.

To wrap up, when does it make sense to use permutation tests? When you have something to permute! I quote from the classic text, An Introduction to the Bootstrap:

Permutation methods tend to apply to only a narrow range of problems. However when they apply, as in testing F = G in a two-sample problem, they give gratifyingly exact answers without parametric assumptions.

A Probability Problem in Heredity – Part 3

In my previous two posts I showed worked solutions to problems 2.5 and 11.7 in Bulmer’s Principles of Statistics, both of which involve the characteristics of self-fertilizing hybrid sweet peas. It turns out that problem 11.8 also involves this same topic, so why not work it as well for completeness. The problem asks us to assume that we were unable to find an explicit solution for the maximum likelihood equation in problem 11.7 and to solve it by using the following iterative method:

\( \theta_{1} = \theta_{0} + \frac{S(\theta_{0})}{I(\theta_{0})} \)

where \( S(\theta_{0}) \) is the value of \( \frac{d \log L}{d\theta}\) evaluated at \( \theta_{0}\) and \( I(\theta_{0})\) is the value of \( -E(\frac{d^{2}\log L}{d\theta^{2}})\) evaluated at \( \theta_{0}\).

So we begin with \( \theta_{0}\) and the iterative method returns \( \theta_{1}\). Now we run the iterative method again starting with \( \theta_{1}\) and get \( \theta_{2}\):

\( \theta_{2} = \theta_{1} + \frac{S(\theta_{1})}{I(\theta_{1})} \)

We repeat this process until we converge upon a value. This is called the Newton-Raphson method. Naturally this is something we would like to have the computer do for us.

First, recall our formulas from problem 11.7:

\( \frac{d \log L}{d\theta} = \frac{1528}{2 + \theta} – \frac{223}{1 – \theta} + \frac{381}{\theta} \)
\( \frac{d^{2}\log L}{d \theta^{2}} = -\frac{1528}{(2 + \theta)^{2}} -\frac{223}{(1 – \theta)^{2}} -\frac{381}{\theta^{2}} \)

Let’s write functions for those in R:

# maximum likelihood score
mls <- function(x) {
	1528/(2 + x) - 223/(1 - x) + 381/x
	}
# the information
inf <- function(x) {
	-1528/((2 + x)^2) - 223/((1 - x)^2) - 381/(x^2)
	}

Now we can use those functions in another function that will run the iterative method starting at a trial value:

# newton-raphson using expected information matrix
nr <- function(th) {
 prev <- th
 repeat {
   new <- prev + mls(prev)/-inf(prev)
   if(abs(prev - new)/abs(new) <0.0001)
     break
   prev <- new
  }
new
}	

This function first takes its argument and names it "prev". Then it starts a repeating loop. The first thing the loop does it calculate the new value using the iterative formula. It then checks to see if the difference between the new and previous value - divided by the new value - is less than 0.0001. If it is, the loop breaks and the "new" value is printed to the console. If not, the loop repeats. Notice that each iteration is hopefully converging on a value. As it converges, the difference between the "prev" and "new" value will get smaller and smaller. So small that dividing the difference by the "new" value (or "prev" value for that matter) will begin to approach 0.

To run this function, we simply call it from the console. Let's start with a value of \( \theta_{0} = \frac{1}{4}\), as the problem suggests:

nr(1/4)
[1] 0.7844304

There you go! We could make the function tell us a little more by outputting the iterative values and number of iterations. Here's a super quick and dirty way to do that:

# newton-raphson using expected information matrix
nr <- function(th) {
 k <- 1 # number of iterations
 v <- c() # iterative values
  prev <- th
  repeat {
    new <- prev + mls(prev)/-inf(prev)
    v[k] <- new
    if(abs(prev - new)/abs(new) <0.0001)
     break
    prev <- new
    k <- k + 1
    }
print(new) # the value we converged on
print(v) # the iterative values
print(k) # number of iterations
}

Now when we run the function we get this:

nr(1/4)
[1] 0.7844304
[1] 0.5304977 0.8557780 0.8062570 0.7863259 0.7844441 0.7844304
[1] 6

We see it took 6 iterations to converge. And with that I think I've had my fill of heredity problems for a while.

A Probability Problem in Heredity

Here’s a fun problem in heredity from the Dover classic Principles of Statistics by M.G. Bulmer. It’s from chapter 2, problem 2.5.

The results of Table 7 on p. 25 (see below) can be explained on the assumption that the genes for flower colour and pollen shape are on the same chromosome but that there is a probability π that one of the genes will be exchanged for the corresponding gene on the other chromosome. If we denote the genes for purple or red flowers by P and p, and the genes for long and round pollen by L and l, then the hybrids from the cross considered will all be of the genotype PL/pl, the notation indicating that the P and L genes are on one chromosome and the p and l genes on the other. When these hybrids are allowed to self-fertilise, there is a chance π that the L and l genes will interchange in one parent, giving Pl/pL; there are therefore really three mating types, PL/pl X PL/pl, Pl/pL X PL/pl and Pl/pL x Pl/pL, which occur with probabilities \( (1 – \pi)^{2} \), \( 2\pi(1 – \pi) \) and \(\pi^{2}\) respectively. Find the probabilities of the four possible phenotypes resulting from the experiment in terms of \( \theta = (1 – \pi)^{2} \).

Here’s the table the problem refers to:

Purple-flowered Red-flowered Total
Long pollen 1528 117 1645
Round pollen 106 381 487
Total 1634 498 2132

The interesting thing about that table is that it violates Mendel’s law of independent assortment. If it obeyed that law, then the probabilities of the resulting phenotypes would be:

Purple-flowered Red-flowered
Long pollen \( \frac{9}{16} \) \( \frac{3}{16} \)
Round pollen \( \frac{3}{16} \) \( \frac{1}{16} \)

We’d expect the purple/long pollen flower to happen about 9/16 = 56% of the time. Instead we see it occurring about 1528/2132 = 72% of the time. As the problem explains this has to do with the way genes are carried on chromosomes. This means we can’t calculate probabilities as you normally would for a dihybrid cross. Therefore it asks us to calculate those probabilities conditional on whether or not the gene for pollen switched chromosomes. Our answer will take the form above, but instead of actual numbers, we’ll express our answer in terms of \( \theta = (1 – \pi)^{2} \). In other words theta is the probability that the gene for pollen did not switch chromosomes.

The hard part of this problem is setting it up. First, we have to recognize that we don’t know which chromosome has the characteristics. The characteristics of mating type PL/pl X PL/pl could both be on the PL chromosomes, in which case the result would be a PL, or a purple/long pollen flower. Or one could be PL and the other pl, which would still be PL, a purple/long pollen flower, since purple and long pollen are dominant characteristics. If both are on the pl chromosome, then the result would be pl, a red/round pollen flower. All told, for the PL/pl X PL/pl mating type we have the following possibilities:

PL/pl X PL/pl mating table (p = \( (1 – \pi)^{2}\) )
PL pl
PL PL PL
pl PL pl

For this mating type we get a purple/long pollen flower (PL) three out of four times and a red/round pollen flower (pl) one out of four times. We need to construct similar tables for the other two mating types. But notice the Pl/pL X PL/pl mating type can actually happen in reverse order as PL/pl X Pl/pL, so it needs two tables. Therefore we really need three more tables:

Pl/pL X PL/pl mating table (p = \( \pi(1 – \pi)\) )
PL pl
Pl PL Pl
pL PL pL
PL/pl X Pl/pL mating table (p = \( \pi(1 – \pi)\) )
Pl pL
PL PL PL
pl Pl pL
Pl/pL X Pl/pL mating table (p = \( \pi^{2}\) )
Pl pL
Pl Pl PL
pL PL pL

Have to give a quick shout out to http://truben.no/latex/table/ for helping me make those tables! Anyway…so we have 4 tables displaying 16 possible results:

  • 9 out of 16 times you get PL, a purple/long pollen flower
  • 3 out of 16 times you get Pl, a purple/round pollen flower
  • 3 out of 16 times you get pL, a red/long pollen flower
  • 1 out of 16 times you get pl, a red/round pollen flower

If the results were independent, we could just call those probabilities. But they’re not. That’s the whole point of the problem. We have to take into account the probability π of the exchange of genes from one chromosome to the other. Let’s reset the probabilities for the four mating types:

  1. PL/pl X PL/pl = \( (1-\pi)^{2} \)
  2. Pl/pL X PL/pl = \( \pi(1-\pi) \)
  3. PL/pl X pl/pL = \( \pi(1-\pi) \)
  4. Pl/pL X Pl/pL = \( \pi^{2} \)

So the probability of getting pl in the first mating type is \( \frac{1}{4}(1-\pi)^{2} \). Recall the problem asks us to find these probabilities in terms of \( \theta = (1 – \pi)^{2} \), so we can express this as \( \frac{1}{4}\theta \). And there’s one of our answers since pl does not occur in any of the other mating types.

Now we just need to find the other probabilities. To make life easier, let’s go ahead and convert all the probabilities in terms of \( \theta \):

  • \( (1 – \pi)^2 = \theta \)
  • \( (1 – \pi) = \theta^{1/2} \)
  • \( \pi = 1 – \theta^{1/2} \)
  • \( \pi^{2} = (1- \theta^{1/2})^{2} \)

The hardest one to find is PL:

\( PL = \frac{3}{4}\theta + \frac{2}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{2}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{2}{4}(1 – \theta^{1/2})^{2} \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2(1 – \theta^{1/2})\theta^{1/2} + 1 – \theta^{1/2} – \theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2(\theta^{1/2} – \theta) + 1 – 2\theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}[2\theta^{1/2} – 2\theta + 1 – 2\theta^{1/2} + \theta] \)
\( PL = \frac{3}{4}\theta + \frac{2}{4}(1 – \theta) \)
\( PL = \frac{3}{4}\theta + \frac{2}{4} – \frac{2}{4}\theta \)
\( PL = \frac{1}{4}\theta + \frac{2}{4} \)
\( PL = \frac{1}{4}(\theta + 2)\)

Next up is pL:

\( pL = \frac{1}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{1}{4}(1 – \theta^{1/2})\theta^{1/2} + \frac{1}{4}(1 – \theta^{1/2})^{2} \)
\( pL = \frac{1}{4}(\theta^{1/2} – \theta + \theta^{1/2} – \theta + 1 – \theta^{1/2} – \theta^{1/2} + \theta) \)
\( pL = \frac{1}{4}(-2\theta + 1 + \theta) \)
\( pL = \frac{1}{4}(1 – \theta) \)

That just leaves Pl. But if you look at the tables above, you’ll notice Pl appears in the same tables with pL the same number of times. So if we set up an equation to find its probability, we’ll get the same equation we started with when we solved Pl. That means we’ll get the same answer. So we don’t need to solve it. We already know it: \( pL = \frac{1}{4}(1 – \theta) \)

And that finishes the problem. The probabilities of the four possible phenotypes are as follows:

purple-flowered red-flowered
long pollen \( \frac{1}{4}(\theta + 2) \) \( \frac{1}{4}(1 – \theta) \)
round pollen \( \frac{1}{4}(1 – \theta) \) \( \frac{1}{4}\theta \)

Buffon’s Needle Problem, or How to use Probability to Estimate Pi

I gave a presentation on Buffon’s Needle Problem in a job interview once. Here’s the presentation I gave in PDF format if you’re interested. If you’ve never heard of Buffon’s Needle Problem, you should open my little presentation and browse through it. It’s one of the damndest things I’ve ever learned.

Here’s the setup. Imagine you have a floor with parallel lines, kind of like hardwood floors, with all lines equally spaced from one another. Let’s say the distance between the lines is L:
buf_01

Now imagine you drop some needles of length L on the floor and count the instances of where the needles cross the lines:
buf_02

Yes, I know, I have red needles! Cool, right? Anyway, here’s the crazy thing: if you drop a lot of needles (say 10,000) and count the number of needles crossing lines, you can use that information to estimate Pi! It turns out that if the distance between lines and the needle length are both 1, then \pi \approx \frac{2n}{k} , where n = number of needles and k = number of needles crossing lines. I don’t know about you but I think that’s nuts!

Let’s fire up R and see this in action. Here’s slightly modified code from the presentation:

a <- 1 # length of needle
L <- 1 # distance between lines
n <- 100000 # number of dropped needles
hit <- 0
for(i in 1:n) {
	x <- runif(1,0,1)
	y <- runif(1,0,1)
	 while(x^2 + y^2 > 1) { # no points outside of unit circle
	  	x <- runif(1,0,1)
	 	y <- runif(1,0,1)
	}	
	theta <- atan(y/x) # the random angle
	d <- runif(1,0,(L/2)) # distance of needle midpoint to nearest line
	if(d <= (a/2)*sin(theta)) {
		hit <- hit + 1
	} 
}
pi.est <- (n*2)/(hit)
pi.est

First I define the distance between the lines (L) and the needle length (a) to both be 1. They don't have to be equal, but the needle length does need to be less than or equal to the distance between the lines. It turns out that in general \pi \approx \frac{2na}{kL}. In my example, I have a = L = 1, so it simplifies to \pi \approx \frac{2n}{k}. Next I define a variable called "hit" to count the number of needles crossing a line and then I dive into a loop to simulate dropping 100,000 needles.

The first 7 lines in the loop generate a random acute angle (less than 90 degrees or \frac{\pi}{2} radians) by way of the arctan function and x and y points that lie within the unit circle. The reason the points need to lie within the unit circle is to ensure all angles have an equal chance of being selected. The next line randomly generates a distance (d) from the midpoint of the needle to the nearest line. Using my random angle and random distance I then check to see if my needle crossed the line. If d \le \frac{a}{2}sin(\theta) then the needle crossed the line and I increment my hit counter. In my presentation I try to justify this mathematical reasoning using pretty pictures.

Finally I calculate my Pi estimate and spit it out. I ran it just now with n = 100,000 and got 3.136517. Not an accurate estimate but pretty close. When I tried it with n = 1,000,000 I got 3.142337. The more needles, the better your estimate.

Now is this practical? Of course not. I can Google Pi and get it to 11 decimal places. R even has a built-in Pi variable (pi). But the fact we can use probability to estimate Pi just never ceases to amaze me. Nice going Buffon!

A Combinatorial Simulation

I have started a new book called The Art of R Programming by Norman Matloff and I’m really digging it. I won’t blog about each chapter the way I did Machine Learning for Hackers, but I did come across something I thought made good blog material.

In Chapter 8 (Doing Math and Simulations in R), Matloff presents an example on combinatorial simulation. The motivation for the example is the following problem:

Three committees, of size 3, 4 and 5, are chosen from 20 people. What is the probability that persons A and B are chosen for the same committee?

Says Matloff: “This problem is not hard to solve analytically, but we may wish to check our solution using simulation…” I suppose it’s not that hard, but if you don’t think about these kinds of problems on a regular basis they can trip you up. He doesn’t give the analytic solution but does provide code for a neat simulation of it:

# 8.6.3 combinatorial simulation
sim <- function(nreps) {
 commdata <- list()
 commdata$countabsamecomm <- 0
 for (rep in 1:nreps) {
 commdata$whosleft <- 1:20
 commdata$numabchosen <- 0
 commdata <- choosecomm(commdata,5)
 if (commdata$numabchosen > 0) next
 commdata <- choosecomm(commdata,4)
 if (commdata$numabchosen > 0) next
 commdata <- choosecomm(commdata,3)
 }
 print(commdata$countabsamecomm/nreps)
}
choosecomm <- function(comdat,comsize) {
 committee <- sample(comdat$whosleft, comsize)
 comdat$numabchosen <- length(intersect(1:2,committee))
 if (comdat$numabchosen == 2)
 comdat$countabsamecomm <- comdat$countabsamecomm + 1
 comdat$whosleft <- setdiff(comdat$whosleft,committee)
 return(comdat)
}

 

That's quite a bit of code. The first block is the main function, called "sim". The second block is the other function, "choosecomm", that gets called in the first function. The whole thing makes creative use of a list and the intersect() and setdiff() functions. If you want to simulate selecting three committees of 3, 4 and 5 people 10,000 times and see how many time persons 1 and 2 are on the same committee, enter sim(100000) at the command line. You should get about 0.10.

Now how do we solve this using math and probability? First off, instead of 20 people, I like to think of 20 chips, 18 of which are white and 2 that are red. Imagine they're in a bag and we reach in and scoop out 12 and immediately and randomly divide into three groups of 3, 4 and 5. What is the probability one of those groups has both red chips? To calculate this we need to enumerate the possible selections that include both red chips and divide that by all possible selections . First let's do it for the committee of 5. All possible combination of 5 from 20 is \binom{20}{5}. All possible combinations including both red chips are \binom{2}{2} \times \binom{18}{3}. We can calculate this in R using the choose() function as follows:

c5 <- choose(18,3)/choose(20,5)

 

We then repeat this for the other two committees of 4 and 3, like so:

c4 <- choose(18,2)/choose(20,4)
c3 <- choose(18,1)/choose(20,3)

 

Summing this up we get c3 + c4 + c5 = 0.10. Now it may seem strange that we always set up our denominator as if we're drawing from 20. After all, in the simulation above, there's a specific order. First 5 are chosen. If the two people are not both on that committee, then we draw 4 and then 3. It seems that should be taken into account when you solve this mathematically. But you don't have to. That's why I like to imagine just scooping 12 people up and assigning instant membership to them on a random basis. While that's not how committee selection usually happens in "real life" it makes thinking about an analytic solution easier.

Incidentally we can simulate the scooping up and assigning of instant membership in R as well:

sim2 <- function(nreps) {
cnt <- 0
for (i in 1:nreps) {
 s <- sample(20,12)
 if (all(1:2 %in% s[1:3])) {cnt <- cnt + 1; next}
 if (all(1:2 %in% s[4:7])) {cnt <- cnt + 1; next}
 if (all(1:2 %in% s[8:12])) cnt <- cnt + 1
}
print(cnt/nreps)
}

 

I call my function "sim2", because I'm super creative like that. Feed my function the number of simulations you want to run and it does the following for each simulation:

1. samples 12 numbers from 1 through 20 and assigns to a vector called "s"
2. checks if 1 and 2 are in the first three elements of s (committee of 3)
3. checks if 1 and 2 are in the next four elements of s (committee of 4)
4. checks if 1 and 2 are in the last five elements of s (committee of 5)

Notice the "next" in the first two if statements to improve efficiency. (No need to check subsequent if statements if the current one is true.) Obviously the sample() function scoops up my 12 people. The indexing of the "s" vector provides my instant committee membership. The first 3 are committee 3. The next 4 committee of 4. The last 5 the committee of 5. The nice thing about my function is that it's way faster than the one in the book. When I timed both functions running 100,000 simulations I got the following results:

> system.time(sim(100000))
[1] 0.1016
 user system elapsed 
 7.74   0.00    7.74 
> system.time(sim2(100000))
[1] 0.10159
 user system elapsed 
 1.42   0.00    1.42

 

Much faster, by 6 seconds. However I must say that Matloff wasn't going for speed and says as much throughout the book. He's writing R code the reader can understand. Just want to mention that lest anyone think I'm trying to one-up the author.

Simulating the Monty Hall Problem

Back in 1990, the following question was posed to PARADE magazine columnist, Marilyn Vos Savant (once holder of the Guinness World Record for highest IQ):

Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car, behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He says to you, “Do you want to pick door #2?” Is it to your advantage to switch your choice of doors?

Marilyn’s response: “Yes; you should switch. The first door has a 1/3 chance of winning, but the second door has a 2/3 chance.” And thus began a firestorm of controversy. Over a thousand people with PhDs (many of them statisticians and mathematicians) wrote in to her column to tell her she was wrong. Wikipedia has a nice summary of how it went down. And Marylin’s own site has a nice sample of some of the thousands of letters she received. In the end, though, she was right.

If you want to know why she was right, visit the links above. You won’t believe how many ways there are to analyze this seemingly simple problem in probability. What I want to do in this post is simulate it using R. If what she said was right, I should win 2/3 of the time.

So here’s what we do. First define a variable to store the number of simulations and a vector to store the results  of each simulation.

# number of simulations
n <- 10000
# allocate memory to store results
x <- vector(mode = "integer", length = n)

Now we do the simulation. Picking the first door can be thought of as a single binomial trial with p = 1/3. We either pick the car (1) or we don't (0). We have a 2/3 chance of not picking the car. After we pick our door, we know another door will be opened to reveal a goat. That happens every time. That means we'll either switch to a goat or to a car. Hence the "if" statement: if we picked the car (1), switch to goat (0); else switch to car(1).

# simulation
for (j in 1:n){
 # pick a door; either goat=0 (p=2/3) or car=1 (p=1/3)
 x[j] <- rbinom(1,1,prob=(1/3)) 
 # switch; if 1, switch to 0, and vice versa
 if (x[j] == 1) {
 x[j] <- 0
 } else x[j] <- 1
}

When the simulation is done, calculate the percentage of times you switched to a car (1):

sum(x)/n # percentage of time you get car

When I ran this I got 0.6667, which is in excellent agreement with the theoretical answer of 2/3. Yay winning cars on game shows!

Understanding Z-scores

The first time I took a statistics class I was mystified with z-scores. I understood how to find a z-score and look up the corresponding probability in a Standard Normal Distribution table. But I had no idea why I did it. I was just following orders. In this post I hope to explain the z-score and why it’s useful.

The first thing to understand about a z-score is that it’s simply a transformation, just like transforming inches to feet. Let’s say you and I each make our own paper airplane, throw them and measure the distance traveled. Mine flies 60 inches. Yours flies 78 inches. Yours flew 18 inches further than mine. We can transform inches to feet by multiplying by \frac{1}{12} . 18 \times \frac{1}{12} = 1.5 . Your airplane flew 1.5 feet further. A z-score is pretty much the same idea. You take an observation from a Normal distribution, measure how far it is from the mean of the distribution, and then convert that distance to a number of standard deviations. A quick example: I observe a measure of 67 inches from a Normal distribution with a mean of 69 and a standard deviation of 4. The z-score is calculated as z = \frac{67-69}{4} = -0.5. That tells me my observation is half a standard deviation away from the mean. The negative tells me it’s less than the mean, which you knew before even doing the calculation. Also, notice there are no units. A z-score doesn’t have units like inches or kilograms. It’s just the number of standard deviations.

So that’s your z-score: distance expressed as number of standard deviations. By itself it’s kind of interesting and informative. But it’s rare you stop with the z-score. Usually in statistics the z-score is the first step toward finding a probability. For example, let’s say a candy maker produces mints that have a label weight of 20.4 grams. Assume that the distribution of the weights of these mints is Normal with a mean of 21.37 and a standard deviation of 0.4. If we select a mint at random off the production line, what is the probability it weighs more than 22.07 grams? This is a classic statistics problem. I pulled this one from Probability and Statistical Inference 7th ed. by Hogg and Tanis.

What this problem boils down to is finding the area of the red region below:

The probability of picking a piece of candy that weighs more than 22.07 grams is equal to the red area under the Normal (21.37, 0.4) distribution curve above. Nowadays this is easily done with computers and calculators. In Excel, you simply enter =1-NORMDIST(22.07,21.37,0.4,TRUE) to find the probability as 0.040059. In R, you do 1-pnorm(22.07,mean=21.37,sd=0.4). However, not so long ago, finding this area was no easy task. Mathematically speaking, finding the area means solving the following integral:

\frac{1}{0.4\sqrt{2\pi}}\int_{22.07}^{\infty}exp[-(x-21.37)^2/ 2(0.4)^2]

Believe me when I say this is not an easy calculus problem to solve. There is no closed form solution for it. The answer has to be approximated using a power series.

Clever statisticians, however, realized that every Normal distribution can be transformed to a Standard Normal distribution using z-scores. That is, all z-scores come from a Normal distribution with mean 0 and standard deviation 1. So what they did was create a table with approximate probabilities for z-scores ranging from 0 to something like 3.49. Here is an example of one such table. This allowed statisticians (and students) to easily solve problems like the candy problem above by transforming the value of interest to a z-score and then looking up the probability in a chart.

For our problem above, we would find z = \frac{22.07-21.37}{0.4} = 1.75. Now the problem is find the probability of exceeding 1.75 in a Normal distribution with mean 0 and standard deviation 1. We would then look up 1.75 in a table and see the resulting probability. Usually the probability (or area under the curve) was calculated to the left of the z-score. For our problem we need area to the right of the z-score. If the table gave area to the left, we would simply subtract from 1. In the table I linked to, the probability for 1.75 is given as 0.9599. That’s area to the left. The area to the right is 1-0.9599 = 0.0401.

For many years, that was the TRUE value of a z-score. It allowed you to calculate areas under the normal curve by giving you a link to a table of pre-calculated values. I think most statistics classes still teach this method of finding probabilities, or at least mention it. But truth be told it’s obsolete. The math is now easily handled by computer. So while the z-score is still a useful descriptive measure of how far an observation lies from the mean of the Normal distribution from which it came, it’s no longer needed as it was in the past to find the area under the Normal curve.

So there you go, the story behind z-scores. Hopefully this post added to your understanding. If anything, I hope you appreciate what they meant to statisticians before modern computing. They truly were the only way to calculate probabilities based on a Normal distribution.