I have started a new book called The Art of R Programming by Norman Matloff and I’m really digging it. I won’t blog about each chapter the way I did Machine Learning for Hackers, but I did come across something I thought made good blog material.
In Chapter 8 (Doing Math and Simulations in R), Matloff presents an example on combinatorial simulation. The motivation for the example is the following problem:
Three committees, of size 3, 4 and 5, are chosen from 20 people. What is the probability that persons A and B are chosen for the same committee?
Says Matloff: “This problem is not hard to solve analytically, but we may wish to check our solution using simulation…” I suppose it’s not that hard, but if you don’t think about these kinds of problems on a regular basis they can trip you up. He doesn’t give the analytic solution but does provide code for a neat simulation of it:
# 8.6.3 combinatorial simulation
sim <- function(nreps) { commdata <- list() commdata$countabsamecomm <- 0 for (rep in 1:nreps) { commdata$whosleft <- 1:20 commdata$numabchosen <- 0 commdata <- choosecomm(commdata,5) if (commdata$numabchosen > 0) next commdata <- choosecomm(commdata,4) if (commdata$numabchosen > 0) next commdata <- choosecomm(commdata,3) } print(commdata$countabsamecomm/nreps) }
choosecomm <- function(comdat,comsize) { committee <- sample(comdat$whosleft, comsize) comdat$numabchosen <- length(intersect(1:2,committee)) if (comdat$numabchosen == 2) comdat$countabsamecomm <- comdat$countabsamecomm + 1 comdat$whosleft <- setdiff(comdat$whosleft,committee) return(comdat) }
That's quite a bit of code. The first block is the main function, called "sim". The second block is the other function, "choosecomm", that gets called in the first function. The whole thing makes creative use of a list and the intersect() and setdiff() functions. If you want to simulate selecting three committees of 3, 4 and 5 people 10,000 times and see how many time persons 1 and 2 are on the same committee, enter sim(100000) at the command line. You should get about 0.10.
Now how do we solve this using math and probability? First off, instead of 20 people, I like to think of 20 chips, 18 of which are white and 2 that are red. Imagine they're in a bag and we reach in and scoop out 12 and immediately and randomly divide into three groups of 3, 4 and 5. What is the probability one of those groups has both red chips? To calculate this we need to enumerate the possible selections that include both red chips and divide that by all possible selections . First let's do it for the committee of 5. All possible combination of 5 from 20 is . All possible combinations including both red chips are . We can calculate this in R using the choose() function as follows:
c5 <- choose(18,3)/choose(20,5)
We then repeat this for the other two committees of 4 and 3, like so:
c4 <- choose(18,2)/choose(20,4) c3 <- choose(18,1)/choose(20,3)
Summing this up we get c3 + c4 + c5 = 0.10. Now it may seem strange that we always set up our denominator as if we're drawing from 20. After all, in the simulation above, there's a specific order. First 5 are chosen. If the two people are not both on that committee, then we draw 4 and then 3. It seems that should be taken into account when you solve this mathematically. But you don't have to. That's why I like to imagine just scooping 12 people up and assigning instant membership to them on a random basis. While that's not how committee selection usually happens in "real life" it makes thinking about an analytic solution easier.
Incidentally we can simulate the scooping up and assigning of instant membership in R as well:
sim2 <- function(nreps) { cnt <- 0 for (i in 1:nreps) { s <- sample(20,12) if (all(1:2 %in% s[1:3])) {cnt <- cnt + 1; next} if (all(1:2 %in% s[4:7])) {cnt <- cnt + 1; next} if (all(1:2 %in% s[8:12])) cnt <- cnt + 1 } print(cnt/nreps) }
I call my function "sim2", because I'm super creative like that. Feed my function the number of simulations you want to run and it does the following for each simulation:
1. samples 12 numbers from 1 through 20 and assigns to a vector called "s"
2. checks if 1 and 2 are in the first three elements of s (committee of 3)
3. checks if 1 and 2 are in the next four elements of s (committee of 4)
4. checks if 1 and 2 are in the last five elements of s (committee of 5)
Notice the "next" in the first two if statements to improve efficiency. (No need to check subsequent if statements if the current one is true.) Obviously the sample() function scoops up my 12 people. The indexing of the "s" vector provides my instant committee membership. The first 3 are committee 3. The next 4 committee of 4. The last 5 the committee of 5. The nice thing about my function is that it's way faster than the one in the book. When I timed both functions running 100,000 simulations I got the following results:
> system.time(sim(100000)) [1] 0.1016 user system elapsed 7.74 0.00 7.74 > system.time(sim2(100000)) [1] 0.10159 user system elapsed 1.42 0.00 1.42
Much faster, by 6 seconds. However I must say that Matloff wasn't going for speed and says as much throughout the book. He's writing R code the reader can understand. Just want to mention that lest anyone think I'm trying to one-up the author.