Author Archives: Clay Ford

Using a Bootstrap to Estimate Power and Significance Level

I’ve been reading Common Errors in Statistics (and How to Avoid Them) by Phillip Good and James Hardin. It’s a good bathroom/bedtime book. You can pick it up and put it down as you please. Each chapter is self-contained and contains bite-size, easy-to-read sections. I’m really happy with it so far.

Anyway, chapter 3 had a section on computing Power and sample size that inspired me to hop on the computer:

If the data do not come from one of the preceding distributions, then we might use a bootstrap to estimate the power and significance level.

In preliminary trials of a new device, the following test results were observed: 7.0 in 11 out of 12 cases and 3.3 in 1 out of 12 cases. Industry guidelines specified that any population with a mean test result greater than 5 would be acceptable. A worst-case or boundary-value scenario would include one in which the test result was 7.0 3/7th of the time, 3.3 3/7th of the time, and 4.1 1/7th of the time. i.e.,  (7 \times \frac{3}{7}) + (3.3 \times \frac{3}{7}) + (4.1 \times \frac{1}{7}) = 5

The statistical procedure required us to reject if the sample mean of the test results were less than 6. To determine the probability of this event for various sample sizes, we took repeated samples with replacement from the two sets of test results.

If you want to try your hand at duplicating these results, simply take the test values in the proportions observed, stick them in a hat, draw out bootstrap samples with replacement several hundred times, compute the sample means, and record the results.

Well of course I want to try my hand at duplicating the results. Who wouldn’t?

The idea here is to bootstrap from two samples: (1) the one they drew in the preliminary trial with mean = 6.69, and (2) the hypothetical worst-case boundary example with mean = 5. We bootstrap from each and calculate the proportion of samples with mean less than 6. The proportion of results with mean less than 6 from the first population (where true mean = 6.69) can serve as a proxy for Type I error or the significance level. This is proportion of times we make the wrong decision. We conclude the mean is less than 6 when in fact it’s really 6.69. The proportion of results with mean less than 6 from the second population (where true mean = 5) can serve as a proxy for Power. This is proportion of times we make the correct decision. We conclude the mean is less than 6 when in fact it’s really 5.

In the book they show the following table of results:
table_3_2

We see they have computed the significance level (Type I error) and power for three different sample sizes. Here’s me doing the same thing in R:

# starting sample of test results (mean = 6.69)
el1 <- c(7.0,3.3)
prob1 <- c(11/12, 1/12)

# hypothetical worst-case population (mean = 5)
el2 <- c(7, 3.3, 4.1)
prob2 <- c(3/7, 3/7, 1/7)

n <- 1000
for (j in 3:5){ # loop through sample sizes
  m1 <- c()
  m2 <- c()
    for (i in 1:n) {
      m1[i] <- mean(sample(el1,j, replace=TRUE,prob=prob1)) # test results
      m2[i] <- mean(sample(el2,j, replace=TRUE,prob=prob2)) # worst-case
    }	
print(paste("Type I error for sample size =",j,"is",sum(m1 < 6.0)/n)) 
print(paste("Power for sample size =",j,"is",sum(m2 < 6.0)/n))
}

To begin I define vectors containing the values and their probability of occurrence. Next I set n = 1000 because I want to do 1000 bootstrap samples. Then I start the first of two for loops. The first is for my sample sizes (3 - 5) and the next is for my bootstrap samples. Each time I begin a new sample size loop I need to create two empty vectors to store the means from each bootstrap sample. I calls these m1 and m2. As I loop through my 1000 bootstrap samples, I take the mean of each sample and assign to the ith element of the m1 and m2 vectors. m1 holds the sample means from the test results and m2 holds the sample means from the worst-case boundary scenario. Finally I print the results using the paste function. Notice how I calculate the proportion. I create a logical vector by calling mx < 6.0. This returns a vector of 0s and 1s, where 0 is false and 1 is true. I then sum this vector to get the number of times the mean was less than 6. Dividing that by n (1000) gives me the proportion. Here are my results:

[1] "Type I error for sample size = 3 is 0.244"
[1] "Power for sample size = 3 is 0.845"
[1] "Type I error for sample size = 4 is 0.04"
[1] "Power for sample size = 4 is 0.793"
[1] "Type I error for sample size = 5 is 0.067"
[1] "Power for sample size = 5 is 0.886"

Pretty much the same thing! I guess I could have used the boot function in the boot package to do this. That’s probably more efficient. But this was a clear and easy way to duplicate their results.

Editing a lot of variable names in a R data frame

Someone I work with asked about how to easily update lots of variable names in a R data frame after importing a CSV file. Apparently the column headers in the CSV file are long and unwieldy, and simply opening the CSV file beforehand and editing the variable names is not desirable. So what to do? Well you’re going to have to actually type the new variable names at least once. There’s no getting around that. But it would be nice to type them only once and never again. This would be useful if you need to import the same CSV file over and over as it gets updated over time. Maybe it’s a quarterly report or weekly data dump of some sort. In that case we can do something like the following:

# read in data set and save variable names into a vector
ds <- read.csv("dataset.csv",header=TRUE)
old_names <- names(ds)

# write variable names to text file
writeLines(old_names, "names.txt")

We imported the CSV file, saved the variable names to a vector, and then wrote that vector to a text file with each variable on its own line. Now we can open the text file in an editor and edit the variable names. I find this easier than doing it in the context of R code. No need for quotes, functions, assignment operators, etc. When you’re done, save and close your text file. Now you’re ready to change your variable names:

names(ds) <- readLines("names.txt")

Done! The next time you need to import the same CSV file, you can just run these two lines:

ds <- read.csv("dataset.csv",header=TRUE)
names(ds) <- readLines("names.txt")

Simulation to Represent Uncertainty in Regression Coefficients

Working through Gelman and Hill’s book can be maddening. The exposition is wonderful but recreating their examples leads to new and exciting levels of frustration. Take the height and earnings example in chapter 4. It’s a simple linear regression of earnings on height. You’d think you could download the data from the book web site, use the code in the book, and reproduce the example. They sure lead you to think that. But it doesn’t work that way. For starters, when you load the data it has 2029 records. However the output of the regression in the book shows n = 1192. So subsetting needs to be done. As far as I can tell, the subsetting is not discussed in the book.

Now the author has an “earnings” folder on his site under an “examples” directory, which contains a R file called “earnings_setup.R“. (Never mind the book itself doesn’t appear to mention this directory on the web site.) So this seems to be the place where we find out how to subset the data. The key line of code is

ok <- !is.na (earn+height+sex+age) & earn>0 & yearbn>25

which creates a logical vector to subset the data. But when you run it and request the dimensions of the data frame you have 1059 records, not 1192! After trial and error I believe the subset to reproduce the results in the book should be

ok <- !is.na (earn+height+sex) & earn>0

That gave me 1192. For the record, here’s my full code:

heights <- read.dta ("http://www.stat.columbia.edu/~gelman/arm/examples/earnings/heights.dta")
attach(heights)
male <- 2 - sex # make male 1 (and therefore female = 0)
ok <- !is.na (earn+height+sex) & earn>0
heights.clean <- as.data.frame (cbind (earn, height, sex, male)[ok,])
heights.clean$log.earn <- log(heights.clean$earn)

OK, so now(!) we can reproduce their example:

earn.logmodel.3 <- lm (log.earn ~ height + male + height:male, data=heights.clean)

The reason I was interested in this example was for another example in chapter 7 on "using simulation to represent uncertainty in regression coefficients" (p. 142). In other words, instead of using the standard errors and intervals obtained from the predict() function in R, we compute uncertainties by simulation. It seems easy enough using the sim() function from the book's arm package. You do something like the following:

library(arm)
sim.1 <- sim(earn.logmodel.3, 1000)

The sim function takes two arguments: your model and the number of simulations. It returns a vector of simulated residual standard deviations and a matrix of simulated regression coefficients. We can use these to calculate confidence intervals for coefficients that do not have standard errors in the regression output. Their example asks "what can be said about the coefficient of height among men?" If you look up at the model, you can see it contains an interaction term of height and male. To answer the question we can't just look at the standard error of the height coefficient or just the interaction coefficient. There is no simple way to do what they ask using regression output. So the book instructs us to use the output of the sim function to answer this as follows:

height.for.men.coef <- sim.1$beta[,2] + sim.1$beta[,4]
quantile(height.for.men.coef, c(0.025,0.975))

Except that doesn't work. More frustration. It produces an error that says "Error in sim.1$beta : $ operator not defined for this S4 class" (instead of giving us a 95% interval). With some googling and persistence I was able to determine that the following is how it should be done:

height.for.men.coef <- sim.1@coef[,2] + sim.1@coef[,4]
quantile(height.for.men.coef, c(0.025,0.975))
         2.5%         97.5% 
-0.0004378039  0.0507464098 

Notice that "@coef" replaces "$beta". And with that I was able to finally reproduce the example I was interested in!

Now about this simulation function. While I appreciate functions that save time and make life easy, I do like to know how they work. Fortunately Gelman and Hill provide pseudo-code in the book. It goes like this:

  1. Run your regression to compute the vector \hat{\beta} of estimated parameters, the unscaled estimation covariance matrix V_{\beta} , and the residual variance \hat{\sigma^{2}}
  2. Create n random simulations for the coefficient vector \beta and residual standard deviation. For each simulation draw:
    1. Simulate \sigma = \hat{\sigma}\sqrt{(n - k)/X} where X is a random draw from the \chi^{2} distribution with n - k degrees of freedom.
    2. Given the random draw of \sigma , simulate \beta from a multivariate normal distribution with mean \hat{\beta} and variance matrix \sigma^{2}V_{\beta}

Not too bad. Let's use this to manually run our own simulations, so we have an idea of how the sim() function works. (Plus you may not want to use the arm package as it requires loading 9 more packages.)

Step 1 is easy enough. That's just running your regression as you normally would. Next we need to extract our estimated parameters, the unscaled covariance matrix and the residual standard deviation. We also need to snag degrees of freedom for our chi-square random draw. Here's how we can get them:

# extract coefficents
earn.logmodel.3$coef

# extract residual standard error from model
summary(earn.logmodel.3)$sigma

# extract unscaled covariance matrix
summary(earn.logmodel.3)$cov.unscaled

# extract k and n - k; first two elements in vector
summary(earn.logmodel.3)$df

Let's use this information to do a single simulation.

s.hat <- summary(earn.logmodel.3)$sigma
n.minus.k <- summary(earn.logmodel.3)$df[2]
library(MASS) # need for mvrnorm function to simulate draws from multivariate normal dist'n
# simulate residual standard deviation
(s.hat.sim <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k)))
[1] 0.8591814
# use simulated residual standard deviation to simulate regression coefficients
mvrnorm(1,earn.logmodel.3$coef, s.hat.sim^2*summary(earn.logmodel.3)$cov.unscaled)
(Intercept)       height         male  height:male 
7.906605029  0.025163124 -0.160828921  0.007904422 

That seems to work. How about we try doing 1000 simulations. Here's one way:

n <- 1000
sim.2.sigma <- rep(NA,n)
sim.2.coef <- matrix(NA,n,4)
for (i in 1:n){
 sim.2.sigma[i] <- s.hat*sqrt(n.minus.k/rchisq(1, n.minus.k))
 sim.2.coef[i,] <- mvrnorm(1,earn.logmodel.3$coef,sim.2.sigma[i]^2*summary(earn.logmodel.3)$cov.unscaled)
}

Now let's see how our simulation results compare to what we got using the sim() function:

height.for.men.coef.2 <- sim.2.coef[,2] + sim.2.coef[,4]
quantile(height.for.men.coef.2, c(0.025,0.975))
        2.5%        97.5% 
-0.001828216  0.049499381 

Looks similar to what we got above. Nice. It's probably better to just use the sim() function for this sort of thing, but at least now we know a little more about what it's doing.

Scraping Virginia Tech Football Data, Part 2

In an earlier post I described how I went about scraping football data off the Virginia Tech athletics web site. Basically I wrote a function that scraped data one season at a time. It worked, but when I decided to return to this project I realized what a pain it would be to do all seasons from 1987 to 2012. I would have to find the ID number for the first game and last game, then run the function for every season. And then do bowl games because the ID number for those games does not follow the pattern of regular season games. It wouldn’t have been unreasonable to just call the function for every season. Typing it all up probably wouldn’t take more than a few minutes. But it’s inefficient so I decided to modify the program to do all seasons at once. Here’s what I came up with.

First I needed to get all the ID numbers. Here’s my code:

allids <- data.frame(season=numeric(0),ID=character(0)) #will hold all box score ids; 322 games from 1987 - 2012
for (i in 1987:2012){
  url <- paste("http://www.hokiesports.com/football/stats/",i,sep="")
  wp <- readLines(url)  
  box <- grep("showstats\\.html\\?[0-9]+",wp,value=TRUE) # split the element at "?"
  ids <- sub(".*?html\\?([0-9]{4,5}).*", "\\1", box)
  ids <- data.frame(season=i,ID=ids)
  allids <- rbind(allids,ids)
}

So I go to each season's page, read the web page source code, find each line that contains "showstats.html?xxxx" (where xxxx = ID number), and pull the ID number into a vector. That last part requires a fancy pants regular expression which took me a while to figure out. It basically says "find everything per the rule in the double quotes, and substitute it with the sub-expression in the parentheses". This is known as "tagging". The part in the parentheses is the tag: ([0-9]{4,5}). It's represented in the sub function with \\1. For more information, see page 99 of Phil Spector's book, Data Manipulation with R. Anyway, I then create a 2 column data frame called "allids" that contains season and ID numbers:

> head(allids)
  season   ID
1   1987 5803
2   1987 5804
3   1987 5805
4   1987 5806
5   1987 5807
6   1987 5808

Now I can use the ID column in my for loop to retrieve drive summaries for every season, like so.

for (i in allids[,2]){
	url <- paste("http://www.hokiesports.com/football/stats/showstats.html?",i,sep="")
	web_page <- readLines(url)

To see the final (for now) version of the code and the CSV file of the data, see my GitHub page for this project. You'll notice I had to build in an error check as it turned out that not all games had drive summaries:

if (length(grep("Virginia Tech Drive Summary", web_page)) == 0) next

That says if grep doesn't find a line with "Virginia Tech Drive Summary" then go to the next iteration in the loop. For some reason the 1994 season only has two games with drive summaries.

So now I have a lot of data and I guess that means I should analyze it. I suppose that will be a future blog post.

Playing with R Markdown

I was playing with R Markdown in R Studio and thought I’d share my results. I can’t believe how easy this is to use! In R Studio, just go File…New…R Markdown. This opens a new template with some helpful code ready to use. This is your Rmd file. It’s basically a script file, except not only can you submit R code from this file, you can also save the output and graphs into one fancy HTML page. This is so much easier than copying-and-pasting R code into a Word Press post and saving/uploading/linking to images created in R. For example, see this post I wrote a while back. It took a long time to make.

Anyway, when you’re done, you click the Knit HTML button and your Rmd file is “knitted” into an HTML file. There’s a little bit of extra code you need to use to ensure proper formatting, but it’s super easy to use. Check out this page to see how easy it is to use. You just surround chunks of your R code in some simple markup.

So here’s what I did. First I worked through the Introductory session of Modern Applied Statistics with S (MASS) by Venables and Ripley. Here is my Rmd file and final output. Next I worked problem 4 from chapter 3 from Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill. Here’s the Rmd file and final output. The final output links are the cool parts. Those are just HTML files with embedded images. I uploaded the Rmd files so you could see the marked-up R code. As you’ll see, there’s not much there. If you want your R code and associated output and graphs to be nicely formatted in HTML, just surround it with

```....```{r}

That’s it. You can also create headers using double-asterisks.

A Logistic Regression Checklist

I recently read The Checklist Manifesto by Atul Gawande and was fascinated by how relatively simple checklists can improve performance and results in such complex endeavors as surgery or flying a commercial airplane. I decided I wanted to make a checklist of my own for Logistic regression. It ended up not being a checklist on how to do it per se, but rather a list of important facts to remember. Here’s what I came up with.

  • Logistic regression models the probability that y = 1, P(y_{i} = 1) = logit^{-1}(X_{i}\beta) where logit^{-1} = \frac{e^{x}}{1+e^{x}}
  • Logistic predictions are probabilistic. It predicts a probability that y = 1. It does not make a point prediction.
  • The function logit^{-1} = \frac{e^{x}}{1+e^{x}} transforms continuous values to the range (0,1).
  • Dividing a regression coefficient by 4 will give an upper bound of the predictive difference corresponding to a unit difference in x. For example if \beta = 0.33, then 0.33/4 = 0.08. This means a unit increase in x corresponds to no more than a 8% positive difference in the probability that y = 1.
  • The odds of success (i.e., y = 1) increase multiplicatively by e^{\beta} for every one-unit increase in x. That is, exponentiating logistic regression coefficients can be interpreted as odds ratios. For example, let’s say we have a regression coefficient of 0.497. Exponentiating gives e^{0.497} = 1.64 . That means the odds of success increase by 64% for each one-unit increase in x. Recall that odds = \frac{p}{1-p} . If our predicted probability at x is 0.674, then the odds of success are \frac{0.674}{0.326} = 2.07 . Therefore at x + 1, the odds will increase by 64% from 2.07 to 2.07(1.64) = 3.40. Notice that 1.64 = \frac{3.40}{2.70}, which is an odds ratio. The ratio of odds of x + 1 to x will always be e^{\beta}, where \beta is a logistic regression coefficient.
  • Plots of raw residuals from logistic regression are generally not useful. Instead it’s preferable to plot binned residuals “by dividing the data into categories (bins) based on their fitted values, and then plotting the average residual versus the average fitted value for each bin.” (Gelman & Hill, p. 97). Example R code for doing this can be found here.
  • The error rate is the proportion of cases in your model that predicts y = 1 when the case is actually y = 0 in the data. We predict y = 1 when the predicted probability exceeds 0.5. Otherwise we predict y = 0. It’s not good if your error rate equals the null rate. The null rate is usually the proportion of 0’s in your data. In other words, if you guessed all cases in your data are y = 1, then the null rate is the percentage you guessed wrong. Let’s say your data has 58% of y = 1 and 42% of y = 0, then the null rate is 42%. Further, let’s say you do some logistic regression on this data and your model has an error rate of 36%. That is, 36% of the time it predicts the wrong outcome. This means your model does only 4% better than simply guessing that all cases are y = 1.
  • Deviance is a measure of error. Lower deviance is better. When an informative predictor is added to a model, we expect deviance to decrease by more than 1. If not, then we’ve likely added a non-informative predictor to the model that just adds noise.
  • If a predictor x is completely aligned with the outcome so that y = 1 when x is above some threshold and y = 0 when x is below some threshold, then the coefficient estimate will explode to some gigantic value. This means the parameter cannot be estimated. This is an identifiability problem called separation.

Most of this comes from Chapter 5 of Data Analysis Using Regression and Multilevel/Hierarchical Models by Gellman and Hill. I also pulled a little from chapter 5 of An Introduction to Categorical Data Analysis by Agresti.

The standard deviation of the sampling distribution of the mean

Someone on Reddit posted Chapter 1 of Howard Wainer’s book, Picturing the Uncertain World. The name of the chapter is The Most Dangerous Equation and its subject is the formula for the standard deviation of the sampling distribution of the mean. This little guy:

\sigma_{\bar{x}}=\frac{\sigma}{\sqrt{n}}

Why is it dangerous? The danger comes from not understanding it. Wainer gives several examples of how ignorance of this equation has “led to billions of dollars of loss over centuries, yielding untold hardship.” But before diving into the examples he briefly describes the formula. It’s not a proof but rather a heuristic demonstration of how it works:

…if we measure, for example, the heights of, say, 1000 students at a particular high school, we might find that the average height is 67 inches, but heights might range from perhaps as little as 55 inches to as much as 80 inches. A number that characterizes this variation is the standard deviation….But now suppose we randomly grouped the 1000 children into 10 groups of 100 and calculated the average within each group. The variation of these 10 averages would likely be much smaller than [the standard deviation of all 1000 students]…

This got me to thinking about how I could demonstrate this in R. It seems like we could generate a population, then take repeated samples of size n to create a sampling distribution, and then show that the standard deviation of the sampling distribution is indeed equal to the population standard deviation divided by n. Let’s do it.

First I load the gtools package which has a function called permutations(). As you might guess, it takes a vector of values and generates all permutations of a given size. After that I generate a “population” of 10 values using the rnorm() function. So my 10 values come from a normal distribution with mean 10 and standard deviation 2, but I’m treating these 10 values as if they’re my entire population for this toy example.

library(gtools)
# generate population of size 10
pop <- rnorm(10,10,2)

Now we're ready to generate the sampling distribution. I decided to let n = 3. This means we need to generate every possible sample of size 3. This is where the permutations() function comes. The first argument tells it how big the source vector is, the second states the size of the target vectors, and the third is the source vector. The last tells it to allow repeats. This is important. This replicates sampling with replacement, which is necessary to demonstrate this formula using a finite population.

# generate the sampling distribution
# first generate all samples of size 3
sdist <- permutations(10,3,pop,repeats=TRUE)

If you look at the variable sdist, you'll see it's a 1000 x 3 matrix. That's 1000 permutations of size 3 from our original population. Next we take the mean of each row (or sample of size 3):

# find the mean of all samples (in rows)
sdm <- apply(sdist,1,mean)

The variable "sdm" is our sampling distribution of the mean. We took every possible sample of size 3 from our population and calculated the mean of each sample. Now the first thing to note is that the mean of the sampling distribution is equal to the mean of our population:

mean(pop) == mean(sdm)
[1] TRUE

But what I really wanted to verify was the formula \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}, or equivalently \sigma_{\bar{x}}^{2} = \frac{\sigma^{2}}{n}:

sum(sdm^2)/1000 - (mean(sdm)^2) == (sum(pop^2)/10 - (mean(pop)^2))/3
[1] TRUE

Aha! It's true! Of course this isn't a proof, but it came out like we expected. Notice I had to manually find the population variance in each case using \sigma^{2} = E(X^{2}) - \mu^{2}. That's because the var() and sd() functions in R divide by n-1.

Using Simulation to Compute Confidence Intervals

I’ve been working through Gelman and Hill’s book, Data Analysis Using Regression and Multilevel/Hierarchical Models. I originally wanted to read it and blog about each chapter the way I did Machine Learning for Hackers. But that book had 12 chapters. This one has 25. And the chapters are longer and denser. I don’t think I have the appetite for such an endeavor. However, I’m still working through the book and want to at least blog about certain topics that catch my eye. Today’s post comes from Chapter 2, Concepts and Methods from Basic Probability and Statistics.

In reviewing confidence intervals, the authors mention using simulation to form confidence intervals for statistics that have complicated standard error formulas. It’s one thing to compute a confidence interval for a mean. The standard error is \frac{s}{\sqrt{n}} . But what about a ratio? Well, that’s more difficult. Thus the book gives a nice example on how to use simulation to find a confidence interval for a ratio (p. 20). Instead of reproducing it verbatim, I thought I would apply the strategy to finding a confidence interval for a median.

My old stats textbook, Probability and Statistical Inference by Hogg and Tanis (7th edition) has a problem in section 6.10 (#7) that gives you measurements of one of the front legs of 20 different spiders and asks you to find a 95% confidence interval for the median. Now this chapter presents a method for calculating this, which is covered in depth at this very nice Penn State web site. Since the problem is odd-numbered, I can flip to the back and see the answer is (15.40, 17.05). Let’s try finding a 95% confidence interval for this data using simulation.

First off, here is the data:

x <- scan("Exercise_6_10-07.txt",skip=1)
> sort(x)
 [1] 13.55 13.60 14.05 15.10 15.25 15.40 15.45 15.75 16.25 16.40 16.45 16.65
[13] 16.80 16.95 17.05 17.55 17.75 19.05 19.05 20.00

The measurements are in millimeters. The median of this data is 16.425, easily calculated in R with the median() function. Now what about a confidence interval? This is after all a sample of 20 spiders. Our median is but a point estimate of the population median, a value we will never be able to directly measure. Using R we can simulate a large number of samples by re-sampling from the data and taking the median each time, like this:

nsims <- 1000 # number of simulations
m <- rep(NA,nsims) # empty vector to store medians
for (i in 1:nsims){  
	m[i] <- median(sample(x,replace=TRUE))
	}

When we're done, we can use the quantile() function as follows to find the 2.5 and 97.5 percentiles and thus estimate a confidence interval:

quantile(m,c(0.025,0.975))
    2.5%    97.5% 
15.42500 17.00125 

That's very close to the given answer of (15.40, 17.05). This example would technically be filed under "bootstrap", but I think it captures the spirit of using simulation to find a confidence interval.

Buffon’s Needle Problem, or How to use Probability to Estimate Pi

I gave a presentation on Buffon’s Needle Problem in a job interview once. Here’s the presentation I gave in PDF format if you’re interested. If you’ve never heard of Buffon’s Needle Problem, you should open my little presentation and browse through it. It’s one of the damndest things I’ve ever learned.

Here’s the setup. Imagine you have a floor with parallel lines, kind of like hardwood floors, with all lines equally spaced from one another. Let’s say the distance between the lines is L:
buf_01

Now imagine you drop some needles of length L on the floor and count the instances of where the needles cross the lines:
buf_02

Yes, I know, I have red needles! Cool, right? Anyway, here’s the crazy thing: if you drop a lot of needles (say 10,000) and count the number of needles crossing lines, you can use that information to estimate Pi! It turns out that if the distance between lines and the needle length are both 1, then \pi \approx \frac{2n}{k} , where n = number of needles and k = number of needles crossing lines. I don’t know about you but I think that’s nuts!

Let’s fire up R and see this in action. Here’s slightly modified code from the presentation:

a <- 1 # length of needle
L <- 1 # distance between lines
n <- 100000 # number of dropped needles
hit <- 0
for(i in 1:n) {
	x <- runif(1,0,1)
	y <- runif(1,0,1)
	 while(x^2 + y^2 > 1) { # no points outside of unit circle
	  	x <- runif(1,0,1)
	 	y <- runif(1,0,1)
	}	
	theta <- atan(y/x) # the random angle
	d <- runif(1,0,(L/2)) # distance of needle midpoint to nearest line
	if(d <= (a/2)*sin(theta)) {
		hit <- hit + 1
	} 
}
pi.est <- (n*2)/(hit)
pi.est

First I define the distance between the lines (L) and the needle length (a) to both be 1. They don't have to be equal, but the needle length does need to be less than or equal to the distance between the lines. It turns out that in general \pi \approx \frac{2na}{kL}. In my example, I have a = L = 1, so it simplifies to \pi \approx \frac{2n}{k}. Next I define a variable called "hit" to count the number of needles crossing a line and then I dive into a loop to simulate dropping 100,000 needles.

The first 7 lines in the loop generate a random acute angle (less than 90 degrees or \frac{\pi}{2} radians) by way of the arctan function and x and y points that lie within the unit circle. The reason the points need to lie within the unit circle is to ensure all angles have an equal chance of being selected. The next line randomly generates a distance (d) from the midpoint of the needle to the nearest line. Using my random angle and random distance I then check to see if my needle crossed the line. If d \le \frac{a}{2}sin(\theta) then the needle crossed the line and I increment my hit counter. In my presentation I try to justify this mathematical reasoning using pretty pictures.

Finally I calculate my Pi estimate and spit it out. I ran it just now with n = 100,000 and got 3.136517. Not an accurate estimate but pretty close. When I tried it with n = 1,000,000 I got 3.142337. The more needles, the better your estimate.

Now is this practical? Of course not. I can Google Pi and get it to 11 decimal places. R even has a built-in Pi variable (pi). But the fact we can use probability to estimate Pi just never ceases to amaze me. Nice going Buffon!

Scraping Data off a Web Site

I’m taking the Data Analysis class through Coursera and one of the topics we’ve covered so far is how to “scape” data off a web site. The idea is to programmatically got through the source code of a web page, pull out some data, and then clean it up so you can analyze it. This may seem like overkill at first glance. After all, why not just select the data with your mouse and copy-and-paste into a spreadsheet? Well, for one, there may be dozens (or hundreds) of pages to visit and copying-and-pasting from each one would be time-consuming and impractical. Second, rarely does a copy-and-paste off a web site produce data ready for analysis. You have to tidy it up, sometimes quite a bit. Clearly these are both tasks we would like to automate.

To put this idea to use, I decided to scrape some data from the box scores of Virginia Tech football games. I attended Tech and love watching their football team, so this seemed like a fun exercise. Here’s an example of one of their box scores. You’ll see it is has everything but what songs the band played during halftime. I decided to start simple and just scrape the Virginia Tech Drive Summaries. This summarizes each drive, including things like number of plays, number of yards gained, and time of possession. Here’s the function I wrote in R, called vtFballData:

vtFballData <- function(start,stop,season){
	dsf <- c()
	# read the source code
	for (i in start:stop){
	url <- paste("http://www.hokiesports.com/football/stats/showstats.html?",i,sep="")
	web_page <- readLines(url)

	# find where VT drive summary begins
	dsum <- web_page[(grep("Virginia Tech Drive Summary", web_page) - 2):
                         (grep("Virginia Tech Drive Summary", web_page) + 18)]
	dsum2 <- readHTMLTable(dsum)
	rn <- dim(dsum2[[1]])[1]
	cn <- dim(dsum2[[1]])[2]
	ds <- dsum2[[1]][4:rn,c(1,(cn-2):cn)]
	ds[,3] <- as.character(ds[,3]) # convert from factor to character
	py <- do.call(rbind,strsplit(sub("-"," ",ds[,3])," "))
	ds2 <- cbind(ds,py)
	ds2[,5] <- as.character(ds2[,5]) # convert from factor to character
	ds2[,6] <- as.character(ds2[,6]) # convert from factor to character
	ds2[,5] <- as.numeric(ds2[,5]) # convert from character to numeric
	ds2[,6] <- as.numeric(ds2[,6]) # convert from character to numeric
	ds2[,3] <- NULL # drop original pl-yds column

	names(ds2) <-c("quarter","result","top","plays","yards")
	# drop unused factor levels carried over from readlines
	ds2$quarter <- ds2$quarter[, drop=TRUE] 
	ds2$result <- ds2$result[, drop=TRUE]

	# convert TOP from factor to character
	ds2[,3] <- as.character(ds2[,3]) 
	# convert TOP from M:S to just seconds
	ds2$top <- sapply(strsplit(ds2$top,":"),
  		function(x) {
    		x <- as.numeric(x)
    		x[1]*60 + x[2]})

	# need to add opponent
	opp <- web_page[grep("Drive Summary", web_page)]
	opp <- opp[grep("Virginia Tech", opp, invert=TRUE)] # not VT
	opp <- strsplit(opp,">")[[1]][2]
	opp <- sub(" Drive Summary

I'm sure this is three times longer than it needs to be and could be written much more efficiently, but it works and I understand it. Let's break it down.

My function takes three values: start, stop, and season. Start and stop are both numerical values needed to specify a range of URLs on hokiesports.com. Season is simply the year of the season. I could have scraped that as well but decided to enter it by hand since this function is intended to retrieve all drive summaries for a given season.

The first thing I do in the function is define an empty variable called "dsf" ("drive summaries final") that will ultimately be what my function returns. Next I start a for loop that will start and end at numbers I feed the function via the "start" and "stop" parameters. For example, the box score of the 1st game of the 2012 season has a URL ending in 14871. The box score of the last regular season game ends in 14882. To hit every box score of the 2012 season, I need to cycle through this range of numbers. Each time through the loop I "paste" the number to the end of "http://www.hokiesports.com/football/stats/showstats.html?" and create my URL. I then feed this URL to the readLines() function which retrieves the code of the web page and I save it as "web_page".

Let's say we're in the first iteration of our loop and we're doing the 2012 season. We just retrieved the code of the box score web page for the Georgia Tech game. If you go to that page, right click on it and view source, you'll see exactly what we have stored in our "web_page" object. You'll notice it has a lot of stuff we don't need. So the next part of my function zeros in on the Virginia Tech drive summary:

# find where VT drive summary begins
dsum <- web_page[(grep("Virginia Tech Drive Summary", web_page) - 2):
                 (grep("Virginia Tech Drive Summary", web_page) + 18)]

This took some trial and error to assemble. The grep() function tells me which line contains the phrase "Virginia Tech Drive Summary". I subtract 2 from that line to get the line number where the HTML table for the VT drive summary begins (i.e., where the opening <table> tag appears). I need this for the upcoming function. I also add 18 to that line number to get the final line of the table code. I then use this range of line numbers to extract the drive summary table and store it as "dsum". Now I feed "dsum" to the readHTMLTable() function, which converts an HTML table to a dataframe (in a list object) and save it as "dsum2". The readHTMLTable() function is part of the XML package, so you have download and install that package first and call library(XML) before running this function.

At this point we have a pretty good looking table. But it has 4 extra rows at the top we need to get rid of. Plus I don't want every column. I only want the first column (quarter) and last three columns (How lost, Pl-Yds, and TOP). This is a personal choice. I suppose I could have snagged every column, but decided to just get a few. To get what I want, I define two new variables, "rn" and "cn". They stand for row number and column number, respectively. "dsum2" is a list object with the table in the first element, [[1]]. I reference that in the call to the dim () function. The first element returned is the number of rows, the second the number of columns. Using "rn" and "cn" I then index dsum2 to pull out a new table called "ds". This is pretty much what I wanted. The rest of the function is mainly just formatting the data and giving names to the columns.

The next three lines of code serve to break up the "Pl-Yds" column into two separate columns: plays and yards. The following five lines change variable classes and remove the old "Pl-Yds" column. After that I assign names to the columns and drop unused factor levels. Next up I convert TOP into seconds. This allows me to do mathematical operations, such as summing and averaging.

The final chunk of code adds the opponent. This was harder than I thought it would be. I'm sure it can be done faster and easier than I did it, but what I does works. First I use the grep() function to identify the two lines that contain the phrase "Drive Summary". One will always have Virginia Tech and the other their opponent. The next line uses the invert parameter of grep to pick the line that does not contain Virginia Tech. The selected line looks like this for the first box score of 2012: "<td colspan=\"9\">Georgia Tech Drive Summary</td>". Now I need to extract "Georgia Tech". To do this I split the string by ">" and save the second element:

opp <- strsplit(opp,">")[[1]][2]

It looks like this after I do the split:

[[1]]
[1] "

Hence the need to add the "[[1]][2]" reference. Finally I substitute " Drive Summary</td" with nothing and that leaves me with "Georgia Tech". Finally I add the season and opponent to the table and update the "dsf" object. The last line is necessary to allow me to add each game summary to the bottom of the previous table of game summaries.

Here's how I used the function to scrape all VT drive summaries from the 2012 regular season:

dsData2012 <- vtFballData(14871,14882,2012)

To identify start and stop numbers I had to go to the VT 2012 stats page and hover over all the box score links to figure out the number sequence. Fortunately they go in order. (Thank you VT athletic dept!) The bowl game is out of sequence; its number is 15513. But I could get it by calling vtFballData(15513,15513,2012). After I call the function, which takes about 5 seconds to run, I get a data frame that looks like this:

 season          opp quarter result top plays yards
   2012 Georgia Tech       1   PUNT 161     6    24
   2012 Georgia Tech       1     TD 287    12    56
   2012 Georgia Tech       1  DOWNS 104     5    -6
   2012 Georgia Tech       2   PUNT 298     7    34
   2012 Georgia Tech       2   PUNT  68     4    10
   2012 Georgia Tech       2   PUNT  42     3     2

Now I'm ready to do some analysis! There are plenty of other variables I could have added, such as whether VT won the game, whether it was a home or away game, whether it was a noon, afternoon or night game, etc. But this was good enough as an exercise. Maybe in the future I'll revisit this little function and beef it up.