Tag Archives: Virginia Tech

Scraping Virginia Tech Football Data, Part 2

In an earlier post I described how I went about scraping football data off the Virginia Tech athletics web site. Basically I wrote a function that scraped data one season at a time. It worked, but when I decided to return to this project I realized what a pain it would be to do all seasons from 1987 to 2012. I would have to find the ID number for the first game and last game, then run the function for every season. And then do bowl games because the ID number for those games does not follow the pattern of regular season games. It wouldn’t have been unreasonable to just call the function for every season. Typing it all up probably wouldn’t take more than a few minutes. But it’s inefficient so I decided to modify the program to do all seasons at once. Here’s what I came up with.

First I needed to get all the ID numbers. Here’s my code:

allids <- data.frame(season=numeric(0),ID=character(0)) #will hold all box score ids; 322 games from 1987 - 2012
for (i in 1987:2012){
  url <- paste("http://www.hokiesports.com/football/stats/",i,sep="")
  wp <- readLines(url)  
  box <- grep("showstats\\.html\\?[0-9]+",wp,value=TRUE) # split the element at "?"
  ids <- sub(".*?html\\?([0-9]{4,5}).*", "\\1", box)
  ids <- data.frame(season=i,ID=ids)
  allids <- rbind(allids,ids)
}

So I go to each season's page, read the web page source code, find each line that contains "showstats.html?xxxx" (where xxxx = ID number), and pull the ID number into a vector. That last part requires a fancy pants regular expression which took me a while to figure out. It basically says "find everything per the rule in the double quotes, and substitute it with the sub-expression in the parentheses". This is known as "tagging". The part in the parentheses is the tag: ([0-9]{4,5}). It's represented in the sub function with \\1. For more information, see page 99 of Phil Spector's book, Data Manipulation with R. Anyway, I then create a 2 column data frame called "allids" that contains season and ID numbers:

> head(allids)
  season   ID
1   1987 5803
2   1987 5804
3   1987 5805
4   1987 5806
5   1987 5807
6   1987 5808

Now I can use the ID column in my for loop to retrieve drive summaries for every season, like so.

for (i in allids[,2]){
	url <- paste("http://www.hokiesports.com/football/stats/showstats.html?",i,sep="")
	web_page <- readLines(url)

To see the final (for now) version of the code and the CSV file of the data, see my GitHub page for this project. You'll notice I had to build in an error check as it turned out that not all games had drive summaries:

if (length(grep("Virginia Tech Drive Summary", web_page)) == 0) next

That says if grep doesn't find a line with "Virginia Tech Drive Summary" then go to the next iteration in the loop. For some reason the 1994 season only has two games with drive summaries.

So now I have a lot of data and I guess that means I should analyze it. I suppose that will be a future blog post.

Scraping Data off a Web Site

I’m taking the Data Analysis class through Coursera and one of the topics we’ve covered so far is how to “scape” data off a web site. The idea is to programmatically got through the source code of a web page, pull out some data, and then clean it up so you can analyze it. This may seem like overkill at first glance. After all, why not just select the data with your mouse and copy-and-paste into a spreadsheet? Well, for one, there may be dozens (or hundreds) of pages to visit and copying-and-pasting from each one would be time-consuming and impractical. Second, rarely does a copy-and-paste off a web site produce data ready for analysis. You have to tidy it up, sometimes quite a bit. Clearly these are both tasks we would like to automate.

To put this idea to use, I decided to scrape some data from the box scores of Virginia Tech football games. I attended Tech and love watching their football team, so this seemed like a fun exercise. Here’s an example of one of their box scores. You’ll see it is has everything but what songs the band played during halftime. I decided to start simple and just scrape the Virginia Tech Drive Summaries. This summarizes each drive, including things like number of plays, number of yards gained, and time of possession. Here’s the function I wrote in R, called vtFballData:

vtFballData <- function(start,stop,season){
	dsf <- c()
	# read the source code
	for (i in start:stop){
	url <- paste("http://www.hokiesports.com/football/stats/showstats.html?",i,sep="")
	web_page <- readLines(url)

	# find where VT drive summary begins
	dsum <- web_page[(grep("Virginia Tech Drive Summary", web_page) - 2):
                         (grep("Virginia Tech Drive Summary", web_page) + 18)]
	dsum2 <- readHTMLTable(dsum)
	rn <- dim(dsum2[[1]])[1]
	cn <- dim(dsum2[[1]])[2]
	ds <- dsum2[[1]][4:rn,c(1,(cn-2):cn)]
	ds[,3] <- as.character(ds[,3]) # convert from factor to character
	py <- do.call(rbind,strsplit(sub("-"," ",ds[,3])," "))
	ds2 <- cbind(ds,py)
	ds2[,5] <- as.character(ds2[,5]) # convert from factor to character
	ds2[,6] <- as.character(ds2[,6]) # convert from factor to character
	ds2[,5] <- as.numeric(ds2[,5]) # convert from character to numeric
	ds2[,6] <- as.numeric(ds2[,6]) # convert from character to numeric
	ds2[,3] <- NULL # drop original pl-yds column

	names(ds2) <-c("quarter","result","top","plays","yards")
	# drop unused factor levels carried over from readlines
	ds2$quarter <- ds2$quarter[, drop=TRUE] 
	ds2$result <- ds2$result[, drop=TRUE]

	# convert TOP from factor to character
	ds2[,3] <- as.character(ds2[,3]) 
	# convert TOP from M:S to just seconds
	ds2$top <- sapply(strsplit(ds2$top,":"),
  		function(x) {
    		x <- as.numeric(x)
    		x[1]*60 + x[2]})

	# need to add opponent
	opp <- web_page[grep("Drive Summary", web_page)]
	opp <- opp[grep("Virginia Tech", opp, invert=TRUE)] # not VT
	opp <- strsplit(opp,">")[[1]][2]
	opp <- sub(" Drive Summary

I'm sure this is three times longer than it needs to be and could be written much more efficiently, but it works and I understand it. Let's break it down.

My function takes three values: start, stop, and season. Start and stop are both numerical values needed to specify a range of URLs on hokiesports.com. Season is simply the year of the season. I could have scraped that as well but decided to enter it by hand since this function is intended to retrieve all drive summaries for a given season.

The first thing I do in the function is define an empty variable called "dsf" ("drive summaries final") that will ultimately be what my function returns. Next I start a for loop that will start and end at numbers I feed the function via the "start" and "stop" parameters. For example, the box score of the 1st game of the 2012 season has a URL ending in 14871. The box score of the last regular season game ends in 14882. To hit every box score of the 2012 season, I need to cycle through this range of numbers. Each time through the loop I "paste" the number to the end of "http://www.hokiesports.com/football/stats/showstats.html?" and create my URL. I then feed this URL to the readLines() function which retrieves the code of the web page and I save it as "web_page".

Let's say we're in the first iteration of our loop and we're doing the 2012 season. We just retrieved the code of the box score web page for the Georgia Tech game. If you go to that page, right click on it and view source, you'll see exactly what we have stored in our "web_page" object. You'll notice it has a lot of stuff we don't need. So the next part of my function zeros in on the Virginia Tech drive summary:

# find where VT drive summary begins
dsum <- web_page[(grep("Virginia Tech Drive Summary", web_page) - 2):
                 (grep("Virginia Tech Drive Summary", web_page) + 18)]

This took some trial and error to assemble. The grep() function tells me which line contains the phrase "Virginia Tech Drive Summary". I subtract 2 from that line to get the line number where the HTML table for the VT drive summary begins (i.e., where the opening <table> tag appears). I need this for the upcoming function. I also add 18 to that line number to get the final line of the table code. I then use this range of line numbers to extract the drive summary table and store it as "dsum". Now I feed "dsum" to the readHTMLTable() function, which converts an HTML table to a dataframe (in a list object) and save it as "dsum2". The readHTMLTable() function is part of the XML package, so you have download and install that package first and call library(XML) before running this function.

At this point we have a pretty good looking table. But it has 4 extra rows at the top we need to get rid of. Plus I don't want every column. I only want the first column (quarter) and last three columns (How lost, Pl-Yds, and TOP). This is a personal choice. I suppose I could have snagged every column, but decided to just get a few. To get what I want, I define two new variables, "rn" and "cn". They stand for row number and column number, respectively. "dsum2" is a list object with the table in the first element, [[1]]. I reference that in the call to the dim () function. The first element returned is the number of rows, the second the number of columns. Using "rn" and "cn" I then index dsum2 to pull out a new table called "ds". This is pretty much what I wanted. The rest of the function is mainly just formatting the data and giving names to the columns.

The next three lines of code serve to break up the "Pl-Yds" column into two separate columns: plays and yards. The following five lines change variable classes and remove the old "Pl-Yds" column. After that I assign names to the columns and drop unused factor levels. Next up I convert TOP into seconds. This allows me to do mathematical operations, such as summing and averaging.

The final chunk of code adds the opponent. This was harder than I thought it would be. I'm sure it can be done faster and easier than I did it, but what I does works. First I use the grep() function to identify the two lines that contain the phrase "Drive Summary". One will always have Virginia Tech and the other their opponent. The next line uses the invert parameter of grep to pick the line that does not contain Virginia Tech. The selected line looks like this for the first box score of 2012: "<td colspan=\"9\">Georgia Tech Drive Summary</td>". Now I need to extract "Georgia Tech". To do this I split the string by ">" and save the second element:

opp <- strsplit(opp,">")[[1]][2]

It looks like this after I do the split:

[[1]]
[1] "

Hence the need to add the "[[1]][2]" reference. Finally I substitute " Drive Summary</td" with nothing and that leaves me with "Georgia Tech". Finally I add the season and opponent to the table and update the "dsf" object. The last line is necessary to allow me to add each game summary to the bottom of the previous table of game summaries.

Here's how I used the function to scrape all VT drive summaries from the 2012 regular season:

dsData2012 <- vtFballData(14871,14882,2012)

To identify start and stop numbers I had to go to the VT 2012 stats page and hover over all the box score links to figure out the number sequence. Fortunately they go in order. (Thank you VT athletic dept!) The bowl game is out of sequence; its number is 15513. But I could get it by calling vtFballData(15513,15513,2012). After I call the function, which takes about 5 seconds to run, I get a data frame that looks like this:

 season          opp quarter result top plays yards
   2012 Georgia Tech       1   PUNT 161     6    24
   2012 Georgia Tech       1     TD 287    12    56
   2012 Georgia Tech       1  DOWNS 104     5    -6
   2012 Georgia Tech       2   PUNT 298     7    34
   2012 Georgia Tech       2   PUNT  68     4    10
   2012 Georgia Tech       2   PUNT  42     3     2

Now I'm ready to do some analysis! There are plenty of other variables I could have added, such as whether VT won the game, whether it was a home or away game, whether it was a noon, afternoon or night game, etc. But this was good enough as an exercise. Maybe in the future I'll revisit this little function and beef it up.