Scraping Virginia Tech Football Data, Part 2

In an earlier post I described how I went about scraping football data off the Virginia Tech athletics web site. Basically I wrote a function that scraped data one season at a time. It worked, but when I decided to return to this project I realized what a pain it would be to do all seasons from 1987 to 2012. I would have to find the ID number for the first game and last game, then run the function for every season. And then do bowl games because the ID number for those games does not follow the pattern of regular season games. It wouldn’t have been unreasonable to just call the function for every season. Typing it all up probably wouldn’t take more than a few minutes. But it’s inefficient so I decided to modify the program to do all seasons at once. Here’s what I came up with.

First I needed to get all the ID numbers. Here’s my code:

allids <- data.frame(season=numeric(0),ID=character(0)) #will hold all box score ids; 322 games from 1987 - 2012
for (i in 1987:2012){
  url <- paste("",i,sep="")
  wp <- readLines(url)  
  box <- grep("showstats\\.html\\?[0-9]+",wp,value=TRUE) # split the element at "?"
  ids <- sub(".*?html\\?([0-9]{4,5}).*", "\\1", box)
  ids <- data.frame(season=i,ID=ids)
  allids <- rbind(allids,ids)

So I go to each season's page, read the web page source code, find each line that contains "showstats.html?xxxx" (where xxxx = ID number), and pull the ID number into a vector. That last part requires a fancy pants regular expression which took me a while to figure out. It basically says "find everything per the rule in the double quotes, and substitute it with the sub-expression in the parentheses". This is known as "tagging". The part in the parentheses is the tag: ([0-9]{4,5}). It's represented in the sub function with \\1. For more information, see page 99 of Phil Spector's book, Data Manipulation with R. Anyway, I then create a 2 column data frame called "allids" that contains season and ID numbers:

> head(allids)
  season   ID
1   1987 5803
2   1987 5804
3   1987 5805
4   1987 5806
5   1987 5807
6   1987 5808

Now I can use the ID column in my for loop to retrieve drive summaries for every season, like so.

for (i in allids[,2]){
	url <- paste("",i,sep="")
	web_page <- readLines(url)

To see the final (for now) version of the code and the CSV file of the data, see my GitHub page for this project. You'll notice I had to build in an error check as it turned out that not all games had drive summaries:

if (length(grep("Virginia Tech Drive Summary", web_page)) == 0) next

That says if grep doesn't find a line with "Virginia Tech Drive Summary" then go to the next iteration in the loop. For some reason the 1994 season only has two games with drive summaries.

So now I have a lot of data and I guess that means I should analyze it. I suppose that will be a future blog post.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.