Category Archives: Discrete distributions

Earthquake data and Benford’s Law

Much has been written about Benford's Law, that weird phenonmenon where if you have a naturally occuring set of numerical data, 30% of the numbers will begin with 1, 18% will begin with 2, 12% will begin with 3, and so on. You might expect the distribution of leading digits to be uniformly distributed, but no, that just isn't the case. 60% of the time the leading digit is 1, 2, or 3. Though it may seem like some curious mathematical anomaly, Benford's Law has been used to detect fraud in elections and accounting.

In this post let's use R to verify that earthquake data obtained from the USGS does indeed follow Benford's Law. On May 28, 2016, I downloaded data for all earthquakes in the past 30 days. The data has a number of fields, including earthquake magnitude, depth of the earthquake, location, time and many others. Benford's Law says those numbers should follow a particular distribution.

The formula for Benford's Law is as follows:

\[P(d) = \log_{10} \left(1+\frac{1}{d}\right) \]

That says the probability that a digit d occurs as the first number is equal the log base 10 of 1 + 1/d. We can quickly generate these for d = 1 – 9 in R:

log10(1 + (1/(1:9)))
## [1] 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 0.06694679
## [7] 0.05799195 0.05115252 0.04575749

And we can make a quick plot as follows:

barplot(log10(1 + (1/(1:9))), names.arg = 1:9, main = "Benford's Law")

plot of chunk unnamed-chunk-2

So according to this law, if we look at the distribution of first digits in our earthquake data, we should see them closely follow this distribution. Let's find out!

First we need to import the data. Thanks to the USGS, this data comes ready for analysis. All we need to do is read it into R:

dat <- read.csv("all_month.csv")
nrow(dat)
## [1] 8584

Over 8500 earthquakes in the past 30 days! A quick look at the magnitude shows us most of them are very small:

summary(dat$mag)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -1.900   0.790   1.280   1.526   1.930   7.200      20

This also tells us we have some negative numbers (?) as well as some missing data, NA's. We also have numbers that start with 0 and have a decimal, such as 0.79. (In this case, Benford would say the number “starts” with a 7, not a 0.) This means when we determine the leading digits, we'll need to ignore negative signs, leading 0s, decimals and missing values.

Let's investigate at the mag column. First we remove the NA's. Below is.na(dat$mag) generates a logical vector of TRUE and FALSE values. Adding an ! in front reverses the TRUE and FALSE values. Finally inserting the result in subsetting brackets returns only those values that are TRUE, ie not missing.

digits <- dat$mag[!is.na(dat$mag)]

Next we extract the first digits. To help with this I use two R packages, magrittr and stringr. The magrittr package allows us to “chain” commands together with the %>% operator. The stringr package provides the str_extract function that allows us to extract phrases that follow a certain pattern. So what we have below can be translated as follows:

  • take the absoluet value of digits (to get rid of negative signs)
  • convert digits to character (so we can use the next two functions)
  • extract anything that is not a “0” or a “.” (We express this a regular expression: "[^0\\.]")
  • extract the first digit (starting and ending at position 1)
library(magrittr)
library(stringr)
digits <- abs(digits) %>% 
  as.character() %>% 
  str_extract(pattern = "[^0\\.]") %>% 
  substr(1,1)

As an extra precaution, we then convert digits to a factor and set levels to be all digits 1 through 9. This ensure all digits are represented in subsequent calculations.

digits <- factor(digits, levels = 1:9) # ensure all digits represented

And finally we tally the first digits and calculate proportions:

table(digits) %>% prop.table()
## digits
##          1          2          3          4          5          6 
## 0.42800141 0.17134139 0.05738763 0.09447248 0.05809177 0.03990142 
##          7          8          9 
## 0.04459570 0.05304542 0.05316277

We see 1 appearing a lot more often, but it's hard to tell how the other digits compare to Benford's Law. Let's put both distributions on the same graph. To do this, I went ahead and created a function that will allow us to check any vector of numbers against Benford's Law. Let's load the function and see how it works, then we'll break it down and explain it.

library(ggplot2)
compareBenford <- function(d){
  digits <- d[!is.na(d)]
  digits <- substr(stringr::str_extract(as.character(abs(digits)), pattern = "[^0\\.]"),1,1)
  digits <- factor(digits, levels = 1:9) # ensure all digits represented
  depth <- prop.table(table(digits))
  ben <- log10(1 + (1/(1:9)))
  dat2 <- data.frame(ben, depth)
  names(dat2) <- c("Benford","Digit",deparse(substitute(d)))
  dat2L <- reshape2::melt(dat2,id.vars="Digit", variable.name = "Type", value.name = "Frequency")
  ggplot(dat2L, aes(x=Digit, y=Frequency, fill=Type)) + 
    geom_bar(stat = "identity", position = "dodge")
}

compareBenford(dat$mag)

plot of chunk unnamed-chunk-9

We see dat$mag has more 1's than we might expect and fewer 3's, but otherwise seems to follow the distribution pretty closely. Let's check out earthquake depth.

compareBenford(dat$depth)

plot of chunk unnamed-chunk-10

This appears to fit even better.

About the function:

  • the first four lines are what we did before. Notice I went ahead and nested the functions instead of using magrittr. That's how I originally coded it before deciding to write a blog post. Then I decided to break out magrittr for the fun of it.
  • After that I calculate Benford's proportions.
  • Then I put both sets of proportions in a data frame.
  • Then I change the names of the data frame. Notice there are three names to change, not two. Why? The depth vector is actually a table. When it gets pulled into a data frame, two columns are produced: the table cells and the table names. The names are the digits 1 – 9. Notice I also use deparse(substitute(d)) to name the first-digit proportions in the data frame. This ensures that whatever vector I give my function, the name of it will be the vector itself. So if I give it dat$mag, the column name of the first-digit proportions will be dat$mag.
  • Next I reshape the data using the melt function in the reshape2 package. This puts all the proportions in one column called Frequency and the type of proportion in another column called Type. I still have my Digit column that indicates which proportion goes with which digit. Having my data in this form allows me to use ggplot to map fill color to Type.
  • Finally I create the graph. I set Digit on the x-axis, Frequency on the y-axis, and map Type to the fill color. The geom_bar function says draw a bar plot. The stat = "identity" and position = "dodge" arguments say set the height of the bars to the actual values and “dodge” them so they're next to one another.

Let's look at ALL numeric columns in the earthquakes data. We need to identify all numeric columns and pull them into a vector for our function to work. Here's how I do it:

theCols <- sapply(dat, class) %in% c("numeric","integer")
numCols <- names(dat)[theCols]
allDigits <- unlist(dat[,numCols])

compareBenford(allDigits)

plot of chunk unnamed-chunk-11

Not a bad fit. Benford's Law seems to work just fine for the earthquake data.

But what about data where Benford's Law doesn't work? Recall that Benford's Law applies to most naturally occuring sets of numbers. Not all. Take the iris data that come with R. These are measurements of 50 flowers from each of 3 species of iris. This is a small data set looking at very specific measurements. It doesn't seem to follow Benford's Law very well.

irisDigits <- unlist(iris[,1:4])
compareBenford(irisDigits)

plot of chunk unnamed-chunk-12

A Probability Problem in Heredity – Part 2

In my last post I solved a problem from chapter 2 of M.G. Bulmer’s Principles of Statistics. In this post I work through a problem in chapter 11 that is basically a continuation of the chapter 2 problem. If you take a look at the previous post, you’ll notice we were asked to find probability in terms of theta. I did it and that’s nice and all, but we can go further. We actually have data, so we can estimate theta. And that’s what the problem in chapter 11 asks us to do. If you’re wondering why it took 9 chapters to get from finding theta to estimating theta, that’s because the first problem involved basic probability and this one requires maximum likelihood. It’s a bit of a jump where statistical background is concerned.

The results of the last post were as follows:

purple-flowered red-flowered
long pollen \( \frac{1}{4}(\theta + 2)\) \( \frac{1}{4}(1 – \theta) \)
round pollen \( \frac{1}{4}(1 – \theta) \) \( \frac{1}{4}\theta \)

The table provides probabilities of four possible phenotypes when hybrid sweet peas are allowed to self-fertilize. For example, the probability of a self-fertilizing sweet pea producing a purple-flower with long pollen is \( \frac{1}{4}(1 – \theta)\). In this post we’ll estimate theta from our data. Recall that \( \theta = (1 – \pi)^{2} \), where \( \pi \) is the probability of the dominant and recessive genes of a characteristic switching chromosomes.

Here’s the data:

Purple-flowered Red-flowered
Long pollen 1528 117
Round pollen 106 381

We see from the table there are 4 exclusive possibilities when the sweet pea self-fertilizes. If we think of each possibility having its own probability of occurrence, then we can think of this data as a sample from a multinomial distribution. Since chapter 11 covers maximum likelihood estimation, the problem therefore asks us to use the multinomial likelihood function to estimate theta.

Now the maximum likelihood estimator for each probability is \( \hat{p_{i}} = \frac{x_{i}}{n} \). But we can’t use that. That’s estimating four parameters. We need to estimate one parameter, theta. So we need to go back to the multinomial maximum likelihood function and define \( p_{i}\) in terms of theta. And of course we’ll work with the log likelihood since it’s easier to work with sums than products.

\( \log L(\theta) = y_{1} \log p_{1} + y_{2} \log p_{2} + y_{3} \log p_{3} + y_{4} \log p_{4} \)

If you’re not sure how I got that, google “log likelihood multinomial distribution” for more PDF lecture notes than you can ever hope to read.

Now let’s define the probabilities in terms of one parameter, theta:

\( \log L(\theta) = y_{1} \log f_{1}(\theta) + y_{2} \log f_{2}(\theta) + y_{3} \log f_{3}(\theta) + y_{4} \log f_{4}(\theta) \)

Now take the derivative. Once we have that we can set equal to 0 and find a solution for theta. The solution will be the point at which theta obtains its maximum value:

\( \frac{d \log L(\theta)}{d\theta} = \frac{y_{1}}{f_{1}(\theta)} f’_{1}(\theta) + \frac{y_{1}}{f_{2}(\theta)} f’_{2}(\theta) + \frac{y_{1}}{f_{3}(\theta)} f’_{3}(\theta) + \frac{y_{1}}{f_{4}(\theta)} f’_{4}(\theta)\)

Time to go from the abstract to the applied with our values. The y’s are the data from our table and the functions of theta are the results from the previous problem.

\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{1/4(2 + \theta)} \frac{1}{4} – \frac{117}{1/4(1 – \theta)}\frac{1}{4} – \frac{106}{1/4(1 – \theta)}\frac{1}{4} + \frac{381}{1/4(\theta)} \frac{1}{4} \)
\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{2 + \theta} – \frac{117}{1 – \theta} – \frac{106}{1 – \theta} + \frac{381}{\theta} \)
\( \frac{d \log L(\theta)}{d\theta} = \frac{1528}{2 + \theta} – \frac{223}{1 – \theta} + \frac{381}{\theta} \)

Set equal to 0 and solve for theta. Beware, it gets messy.

\( \frac{[1528(1 – \theta)\theta] – [223(2 + \theta)\theta] + [381(2 + \theta)(1 – \theta)]}{(2 + \theta)(1- \theta)\theta} = 0\)

Yeesh. Fortunately the denominator cancels out when we start multiplying terms and solving for theta. So we’re left with this:

\( 1528\theta – 1528\theta^{2} – 446\theta – 223\theta^{2} + 762 – 381\theta – 381\theta^{2} = 0\)

And that reduces to the following quadratic equation:

\( -2132\theta^{2} + 701\theta + 762 = 0\)

I propose we use an online calculator to solve this equation. Here’s a good one. Just plug in the coefficients and hit solve to find the roots. Our coefficients are a = -2132, b = 701, and c = 762. Since it’s a quadratic equation we get two answers:

\( x_{1} = -0.4556 \)
\( x_{2} = 0.7844 \)

The answer is in terms of a probability which is between 0 and 1, so we toss the negative answer and behold our maximum likelihood estimator for theta: 0.7844.

Remember that \( \theta = (1 – \pi)^{2}\). If we solve for pi, we get \( \pi = 1 – \theta^{1/2}\), which works out to be 0.1143. That is, we estimate the probability of characteristic genes switching chromosomes to be about 11%. Therefore we can think of theta as the probability of having two parents that experienced no gene switching.

Now point estimates are just estimates. We would like to know how good the estimate is. That’s where confidence intervals come in to play. Let’s calculate one for theta.

It turns out that we can estimate the variability of our maximum likelihood estimator as follows:

\( V(\theta) = \frac{1}{I(\theta)}\), where \( I(\theta) = -E(\frac{d^{2}\log L}{d \theta^{2}}) \)

We need to determine the second derivative of our log likelihood equation, take the expected value by plugging in our maximum likelihood estimator, multiply that by -1, and then take the reciprocal. The second derivative works out to be:

\( \frac{d^{2}\log L}{d \theta^{2}} = -\frac{1528}{(2 + \theta)^{2}} -\frac{223}{(1 – \theta)^{2}} -\frac{381}{\theta^{2}} \)

The negative expected value of the second derivative is obtained by plugging in our estimate of 0.7844 and multiplying by -1. Let’s head to the R console to calculate this:

th <- 0.7844 # our ML estimate
(I <- -1 * (-1528/((2+th)^2) - 223/((1-th)^2) - 381/(th^2))) # information
[1] 5613.731

Now take the reciprocal and we have our variance:

(v.th <- 1/I)
[1] 0.0001781347

We can take the square root of the variance to get the standard error and multiply by 1.96 to get the margin of error for our estimate. Then add/subtract the margin of error to our estimate for a confidence interval:

# confidence limits on theta
0.784+(1.96*sqrt(v.th)) # upper bound
[1] 0.8101596
0.784-(1.96*sqrt(v.th)) # lower bound
[1] 0.7578404

Finally we convert the confidence interval for theta to a confidence interval for pi:

# probability of switching over
th.ub <- 0.784+(1.96*sqrt(v.th))
th.lb <- 0.784-(1.96*sqrt(v.th))
1-sqrt(th.ub) # lower bound
[1] 0.09991136
1-sqrt(th.lb) # upper bound
[1] 0.1294597

The probability of genes switching chromosomes is most probably in the range of 10% to 13%.

Deriving the Poisson Distribution

The Poisson distribution can be derived from the binomial distribution by doing two steps:

  1. substitute \frac{\mu}{n} for p
  2. Let n increase without bound

Step one is possible because the mean of a binomial distribution is \mu = np. So another way of expressing p, the probability of success on a single trial, is \frac{\mu}{n} . This has some intuition. Recall that a binomial distribution gives the probability of a number of successes (x) in a fixed number of trials (n) for some probability (p). So if an event has a probability of 0.2, and we observe 10 trials (where the trials are independent), the expected value of seeing the event occur is 10(0.2) = 2. On average we would see the event happen twice in 10 trials. Which means we can state the probability as a rate, \frac{2}{10}. Success occurs, on average, 2 times per 10 trials.

The next step leads to the Poisson. We let the number of trials increase without bound, which means the probability shrinks. Since n is in the denominator of \frac{\mu}{n} , this gives a small chance of success in a huge number of trials. So whereas the binomial deals with a fixed number of trials, the Poisson accounts for an unlimited number of trials in time or space. Taking the limit as n tends to infinity gives us the Poisson. Here’s how.

Recall the binomial probability mass function:
f(x) = \frac{n!}{x!(n-x)!}p^{x}(1-p)^{n-x}

Substitute \frac{\mu}{n} for p:
f(x) = \frac{n!}{x!(n-x)!}(\frac{\mu}{n})^{x}(1-\frac{\mu}{n})^{n-x}

Now the fun begins. Let n grow very large:
\lim_{n \to \infty}\frac{n!}{x!(n-x)!}(\frac{\mu}{n})^{x}(1-\frac{\mu}{n})^{n-x}

But wait! Don’t let n blow up just yet. Let’s do some rearranging first to help us take the limit:
\lim_{n \to \infty}\frac{n!}{(n-x)!n^{x}}\frac{\mu^{x}}{x!}(1-\frac{\mu}{n})^{n}(1-\frac{\mu}{n})^{-x}

And let’s also simplify \frac{n!}{(n-x)!n^{x}}:

\frac{n!}{(n-x)!n^{x}} = \frac{n(n-1)(n-2)\dots(n-x+1)}{n^{x}} =1(1-\frac{1}{n})\dots(1-\frac{x-2}{n})(1-\frac{x-1}{n})

Before we flip the switch on the limit-to-infinity machine, let’s assess what we got:
\lim_{n \to \infty}1(1-\frac{1}{n})\dots(1-\frac{x-2}{n})(1-\frac{x-1}{n})\frac{\mu^{x}}{x!}(1-\frac{\mu}{n})^{n}(1-\frac{\mu}{n})^{-x}

Notice this is basically four things multiplied together:

  1. 1(1-\frac{1}{n})\dots(1-\frac{x-2}{n})(1-\frac{x-1}{n})
  2. \frac{\mu^{x}}{x!}
  3. (1-\frac{\mu}{n})^{n}
  4. (1-\frac{\mu}{n})^{-x}

We can make life easier by taking the limit of each of those four pieces individually:

  1. \lim_{n \to \infty}1(1-\frac{1}{n})\dots(1-\frac{x-2}{n})(1-\frac{x-1}{n})=1
  2. \lim_{n \to \infty}\frac{\mu^{x}}{x!}=\frac{\mu^{x}}{x!} (no n to blow up!)
  3. \lim_{n \to \infty}(1-\frac{\mu}{n})^{n}=e^{-\mu}
  4. \lim_{n \to \infty}(1-\frac{\mu}{n})^{-x}=1

1, 2, and 4 are easy. Number 3, though, requires a long forgotten formula you probably learned in calculus just long enough to take an exam:

\lim_{n \to \infty}(1+\frac{b}{n})^{n}=e^{b}

Set b = -\mu and you get the result.

Now put it all back together and you have the probability mass function of the Poisson distribution:

f(x) = \frac{\mu^{x}e^{-\mu}}{x!}

Often you see \mu as \lambda in this formula.

This new formula is approximately the binomial probability mass function for a large number of trials and small probability. That’s basically what we did, right? Substitute an expression for p involving n and let n grow large. That means the probability got small while n got large. Which further means that before we had modern computers, the Poisson was very handy for approximating binomial probabilities since the binomial coefficient can be difficult to compute for large n.

For example, say we have 100 independent trials with probability of success 0.05. What is the probability of observing 7 successes?

Using the binomial:
f(7) = \frac{100!}{7!(100-7)!}0.05^{7}(1-0.05)^{100-7} = 0.106026

Using the Poisson as an approximation (where \mu = np = 100(0.05) = 5):
f(7) = \frac{5^{7}e^{-5}}{7!} = 0.104445

Very close indeed!

By the way, how did I calculate those probabilities? I used Excel functions:

Binomial: =BINOMDIST(7,100,0.05,0)
Poisson: =POISSON(7,5,0)

The numbers in the parentheses should be self-explanatory. The final 0 means “false, no cumulative”. If you set it to 1, you get the probability of 7 or fewer successes in 100 trials. You can also use a TI-83 Plus. Just press 2nd – DISTR and choose binompdf or poissonpdf. If want to do cumulative probabilities, select binomcdf or poissoncdf. The parameters are in a different order from Excel:

binomcdf/binompdf(n,p,x)
poissoncdf/poissonpdf(mu,x)

Of course “real statisticians” would use R:

> dbinom(7,100,0.05)
[1] 0.1060255

> dpois(7,5)
[1] 0.1044449

I would be careless not to mention that the Poisson distribution is not the same as the binomial. The binomial describes the distribution of events (or “successes”) over a fixed number of trials for a fixed probability of success on each trial. The Poisson describes the distribution of events in space or time for a given rate of occurrence. The binomial is specified by n (number of trails) and p (probability of success on each trial). The Poisson is specified by \mu, the rate at which events occur per unit of space or time.

And finally, and no doubt most important, the way to pronounce Poisson is \pwa-sawn\. Say it correctly and you earn instant credibility at cocktail parties.

How many rolls of a die to see all sides at least once?

Say you have a fair die. (Most of them are these days.) You roll it once and observe a 3. You roll it again and observe a 1. You roll it yet again and get a 4. In three throws you’ve observed 3 of the possible sides. What if you keep going? How many rolls would it take to see all six sides at least once. Obviously the minimum number is 6. But the probability of that happening is about 0.015. (1 \times \frac{5}{6}\times \frac{4}{6}\times \frac{3}{6}\times \frac{2}{6}\times \frac{1}{6}). What is the most likely number of rolls? Or put in the language of a statistics textbook, what’s the expected number of rolls?

To solve this problem we need to use the Geometric Distribution. If a series of trials are independent of one another, and each trial has the same probability of success, then the number of trials needed to observe the first success has a geometric distribution. This is the scenario of our roll of the die. Each roll is independent of one another and the probability of success remains the same for each roll, whatever we define “success” to be. The mean of a geometric distribution is \frac{1}{p}. That means the expected number of times we need to roll a dice to observe, say, a four is 6.

Back to our problem. We’re certain to get at least one of the faces on the first roll. So that probability is 1. On the second roll we have a probability of \frac{5}{6} of getting a different face than what we got on the first roll. It’s not certain. There is a probability of \frac{1}{6} of getting the same result as the first roll. Once we have observed two sides, the probability of observing a third side becomes \frac{4}{6}. On so on. Notice that as we observe previously unseen sides the definition of success changes, therefore the probability changes. So to determine the expected number of rolls to see all sides of a die at least once, we need to find the expected number of rolls for each definition of “success” and sum them. The steps below describe the process. (The numbers represent the order of the observed sides, not the values on the dice.)

  1. We’re guaranteed to see one of the sides on the first roll. Expected number of rolls: 1
  2. We have \frac{5}{6} probability of seeing a different side than what was previously observed. Expected number of rolls: 1/\frac{5}{6}=1.2
  3. We have \frac{4}{6} probability of seeing a different side than what was previously observed in steps 1 and 2. Expected number of rolls: 1/\frac{4}{6}=1.5
  4. We have \frac{3}{6} probability of seeing a different side than what was previously observed in steps 1 -3. Expected number of rolls: 1/\frac{3}{6}=2
  5. We have \frac{2}{6} probability of seeing a different side than what was previously observed in steps 1-4. Expected number of rolls: 1/\frac{2}{6}=3
  6. We have \frac{1}{6} probability of seeing a different side than what was previously observed in steps 1-5. Expected number of rolls: 1/\frac{1}{6}=6

Adding all the expected number of rolls for each definition of success we get 14.7. So we expect to roll a die about 15 times on average  before we observe all sides at least once.

We can use this same line of thinking to answer such “real-life” questions as how many boxes of cereal would you have to buy to collect the latest series of cheap plastic toys, assuming the toys are randomly added to the cereal in a uniform manner. Of course if you’re desperate to collect the precious toys and have a modicum of patience, the answer is probably 0. You’d just wait to buy the whole set on eBay.