Monthly Archives: February 2013

Buffon’s Needle Problem, or How to use Probability to Estimate Pi

I gave a presentation on Buffon’s Needle Problem in a job interview once. Here’s the presentation I gave in PDF format if you’re interested. If you’ve never heard of Buffon’s Needle Problem, you should open my little presentation and browse through it. It’s one of the damndest things I’ve ever learned.

Here’s the setup. Imagine you have a floor with parallel lines, kind of like hardwood floors, with all lines equally spaced from one another. Let’s say the distance between the lines is L:
buf_01

Now imagine you drop some needles of length L on the floor and count the instances of where the needles cross the lines:
buf_02

Yes, I know, I have red needles! Cool, right? Anyway, here’s the crazy thing: if you drop a lot of needles (say 10,000) and count the number of needles crossing lines, you can use that information to estimate Pi! It turns out that if the distance between lines and the needle length are both 1, then \( \pi \approx \frac{2n}{k} \), where n = number of needles and k = number of needles crossing lines. I don’t know about you but I think that’s nuts!

Let’s fire up R and see this in action. Here’s slightly modified code from the presentation:

a <- 1 # length of needle
L <- 1 # distance between lines
n <- 100000 # number of dropped needles
hit <- 0
for(i in 1:n) {
	x <- runif(1,0,1)
	y <- runif(1,0,1)
	 while(x^2 + y^2 > 1) { # no points outside of unit circle
	  	x <- runif(1,0,1)
	 	y <- runif(1,0,1)
	}	
	theta <- atan(y/x) # the random angle
	d <- runif(1,0,(L/2)) # distance of needle midpoint to nearest line
	if(d <= (a/2)*sin(theta)) {
		hit <- hit + 1
	} 
}
pi.est <- (n*2)/(hit)
pi.est

First I define the distance between the lines (L) and the needle length (a) to both be 1. They don't have to be equal, but the needle length does need to be less than or equal to the distance between the lines. It turns out that in general \( \pi \approx \frac{2na}{kL}\). In my example, I have \( a = L = 1\), so it simplifies to \( \pi \approx \frac{2n}{k}\). Next I define a variable called "hit" to count the number of needles crossing a line and then I dive into a loop to simulate dropping 100,000 needles.

The first 7 lines in the loop generate a random acute angle (less than 90 degrees or \( \frac{\pi}{2}\) radians) by way of the arctan function and x and y points that lie within the unit circle. The reason the points need to lie within the unit circle is to ensure all angles have an equal chance of being selected. The next line randomly generates a distance (d) from the midpoint of the needle to the nearest line. Using my random angle and random distance I then check to see if my needle crossed the line. If \( d \le \frac{a}{2}sin(\theta)\) then the needle crossed the line and I increment my hit counter. In my presentation I try to justify this mathematical reasoning using pretty pictures.

Finally I calculate my Pi estimate and spit it out. I ran it just now with n = 100,000 and got 3.136517. Not an accurate estimate but pretty close. When I tried it with n = 1,000,000 I got 3.142337. The more needles, the better your estimate.

Now is this practical? Of course not. I can Google Pi and get it to 11 decimal places. R even has a built-in Pi variable (pi). But the fact we can use probability to estimate Pi just never ceases to amaze me. Nice going Buffon!

Scraping Data off a Web Site

I’m taking the Data Analysis class through Coursera and one of the topics we’ve covered so far is how to “scape” data off a web site. The idea is to programmatically got through the source code of a web page, pull out some data, and then clean it up so you can analyze it. This may seem like overkill at first glance. After all, why not just select the data with your mouse and copy-and-paste into a spreadsheet? Well, for one, there may be dozens (or hundreds) of pages to visit and copying-and-pasting from each one would be time-consuming and impractical. Second, rarely does a copy-and-paste off a web site produce data ready for analysis. You have to tidy it up, sometimes quite a bit. Clearly these are both tasks we would like to automate.

To put this idea to use, I decided to scrape some data from the box scores of Virginia Tech football games. I attended Tech and love watching their football team, so this seemed like a fun exercise. Here’s an example of one of their box scores. You’ll see it is has everything but what songs the band played during halftime. I decided to start simple and just scrape the Virginia Tech Drive Summaries. This summarizes each drive, including things like number of plays, number of yards gained, and time of possession. Here’s the function I wrote in R, called vtFballData:

vtFballData <- function(start,stop,season){
	dsf <- c()
	# read the source code
	for (i in start:stop){
	url <- paste("http://www.hokiesports.com/football/stats/showstats.html?",i,sep="")
	web_page <- readLines(url)

	# find where VT drive summary begins
	dsum <- web_page[(grep("Virginia Tech Drive Summary", web_page) - 2):
                         (grep("Virginia Tech Drive Summary", web_page) + 18)]
	dsum2 <- readHTMLTable(dsum)
	rn <- dim(dsum2[[1]])[1]
	cn <- dim(dsum2[[1]])[2]
	ds <- dsum2[[1]][4:rn,c(1,(cn-2):cn)]
	ds[,3] <- as.character(ds[,3]) # convert from factor to character
	py <- do.call(rbind,strsplit(sub("-"," ",ds[,3])," "))
	ds2 <- cbind(ds,py)
	ds2[,5] <- as.character(ds2[,5]) # convert from factor to character
	ds2[,6] <- as.character(ds2[,6]) # convert from factor to character
	ds2[,5] <- as.numeric(ds2[,5]) # convert from character to numeric
	ds2[,6] <- as.numeric(ds2[,6]) # convert from character to numeric
	ds2[,3] <- NULL # drop original pl-yds column

	names(ds2) <-c("quarter","result","top","plays","yards")
	# drop unused factor levels carried over from readlines
	ds2$quarter <- ds2$quarter[, drop=TRUE] 
	ds2$result <- ds2$result[, drop=TRUE]

	# convert TOP from factor to character
	ds2[,3] <- as.character(ds2[,3]) 
	# convert TOP from M:S to just seconds
	ds2$top <- sapply(strsplit(ds2$top,":"),
  		function(x) {
    		x <- as.numeric(x)
    		x[1]*60 + x[2]})

	# need to add opponent
	opp <- web_page[grep("Drive Summary", web_page)]
	opp <- opp[grep("Virginia Tech", opp, invert=TRUE)] # not VT
	opp <- strsplit(opp,">")[[1]][2]
	opp <- sub(" Drive Summary

I'm sure this is three times longer than it needs to be and could be written much more efficiently, but it works and I understand it. Let's break it down.

My function takes three values: start, stop, and season. Start and stop are both numerical values needed to specify a range of URLs on hokiesports.com. Season is simply the year of the season. I could have scraped that as well but decided to enter it by hand since this function is intended to retrieve all drive summaries for a given season.

The first thing I do in the function is define an empty variable called "dsf" ("drive summaries final") that will ultimately be what my function returns. Next I start a for loop that will start and end at numbers I feed the function via the "start" and "stop" parameters. For example, the box score of the 1st game of the 2012 season has a URL ending in 14871. The box score of the last regular season game ends in 14882. To hit every box score of the 2012 season, I need to cycle through this range of numbers. Each time through the loop I "paste" the number to the end of "http://www.hokiesports.com/football/stats/showstats.html?" and create my URL. I then feed this URL to the readLines() function which retrieves the code of the web page and I save it as "web_page".

Let's say we're in the first iteration of our loop and we're doing the 2012 season. We just retrieved the code of the box score web page for the Georgia Tech game. If you go to that page, right click on it and view source, you'll see exactly what we have stored in our "web_page" object. You'll notice it has a lot of stuff we don't need. So the next part of my function zeros in on the Virginia Tech drive summary:

# find where VT drive summary begins
dsum <- web_page[(grep("Virginia Tech Drive Summary", web_page) - 2):
                 (grep("Virginia Tech Drive Summary", web_page) + 18)]

This took some trial and error to assemble. The grep() function tells me which line contains the phrase "Virginia Tech Drive Summary". I subtract 2 from that line to get the line number where the HTML table for the VT drive summary begins (i.e., where the opening <table> tag appears). I need this for the upcoming function. I also add 18 to that line number to get the final line of the table code. I then use this range of line numbers to extract the drive summary table and store it as "dsum". Now I feed "dsum" to the readHTMLTable() function, which converts an HTML table to a dataframe (in a list object) and save it as "dsum2". The readHTMLTable() function is part of the XML package, so you have download and install that package first and call library(XML) before running this function.

At this point we have a pretty good looking table. But it has 4 extra rows at the top we need to get rid of. Plus I don't want every column. I only want the first column (quarter) and last three columns (How lost, Pl-Yds, and TOP). This is a personal choice. I suppose I could have snagged every column, but decided to just get a few. To get what I want, I define two new variables, "rn" and "cn". They stand for row number and column number, respectively. "dsum2" is a list object with the table in the first element, [[1]]. I reference that in the call to the dim () function. The first element returned is the number of rows, the second the number of columns. Using "rn" and "cn" I then index dsum2 to pull out a new table called "ds". This is pretty much what I wanted. The rest of the function is mainly just formatting the data and giving names to the columns.

The next three lines of code serve to break up the "Pl-Yds" column into two separate columns: plays and yards. The following five lines change variable classes and remove the old "Pl-Yds" column. After that I assign names to the columns and drop unused factor levels. Next up I convert TOP into seconds. This allows me to do mathematical operations, such as summing and averaging.

The final chunk of code adds the opponent. This was harder than I thought it would be. I'm sure it can be done faster and easier than I did it, but what I does works. First I use the grep() function to identify the two lines that contain the phrase "Drive Summary". One will always have Virginia Tech and the other their opponent. The next line uses the invert parameter of grep to pick the line that does not contain Virginia Tech. The selected line looks like this for the first box score of 2012: "<td colspan=\"9\">Georgia Tech Drive Summary</td>". Now I need to extract "Georgia Tech". To do this I split the string by ">" and save the second element:

opp <- strsplit(opp,">")[[1]][2]

It looks like this after I do the split:

[[1]]
[1] "

Hence the need to add the "[[1]][2]" reference. Finally I substitute " Drive Summary</td" with nothing and that leaves me with "Georgia Tech". Finally I add the season and opponent to the table and update the "dsf" object. The last line is necessary to allow me to add each game summary to the bottom of the previous table of game summaries.

Here's how I used the function to scrape all VT drive summaries from the 2012 regular season:

dsData2012 <- vtFballData(14871,14882,2012)

To identify start and stop numbers I had to go to the VT 2012 stats page and hover over all the box score links to figure out the number sequence. Fortunately they go in order. (Thank you VT athletic dept!) The bowl game is out of sequence; its number is 15513. But I could get it by calling vtFballData(15513,15513,2012). After I call the function, which takes about 5 seconds to run, I get a data frame that looks like this:

 season          opp quarter result top plays yards
   2012 Georgia Tech       1   PUNT 161     6    24
   2012 Georgia Tech       1     TD 287    12    56
   2012 Georgia Tech       1  DOWNS 104     5    -6
   2012 Georgia Tech       2   PUNT 298     7    34
   2012 Georgia Tech       2   PUNT  68     4    10
   2012 Georgia Tech       2   PUNT  42     3     2

Now I'm ready to do some analysis! There are plenty of other variables I could have added, such as whether VT won the game, whether it was a home or away game, whether it was a noon, afternoon or night game, etc. But this was good enough as an exercise. Maybe in the future I'll revisit this little function and beef it up.