I’ve started working through the book, Machine Learning for Hackers, and decided I would document what I learned or found interesting in each chapter. Chapter 1, titled “Using R”, gives an intro to R in the form of cleaning and prepping UFO data for analysis. I wouldn’t recommend this as a true introduction to R. It’s basically the authors preparing a dataset and stopping after each set of commands to explain what they did. I’ve been an R user for a few years now and I had to work a little to follow everything they did. No problem there. I’m always up for learning something new and long ago realized I will never know all there is to know about R. But I don’t think Chapter 1 of Machine Learning for Hackers is good place to start if you’re new to R. Anyway, on to what I want to remember.
The dataset they work with has two columns of dates in the format YYYYMMDD. They want to convert the dates to actual date types. To do that use the as.Date function. But before they can do that, they need to identify dates that are less than or more than 8 characters long. Here’s how they do it:
good.rows <- ifelse(nchar(ufo$DateOccurred) != 8 | nchar(ufo$DateReported) != 8, FALSE, TRUE)
This creates a vector of TRUE/FALSE values that is the same length as the original UFO dataset. If either of the two dates have character length not equal to 8, FALSE is returned. Otherwise TRUE. Next they use the vector of T/F values (called "good.rows") to subset the UFO data:
ufo <- ufo[good.rows,]
This keeps only those records (or rows) that have a corresponding TRUE value in the good.rows vector. So if row 25 in ufo has a TRUE value in "row" 25 of good.rows, that entire row is saved. The new ufo dataset will only have rows where both date values are exactly 8 characters long.
They write a function to do some data cleaning. In the function they use the gsub function to remove leading white spaces from each character, as follows:
The first argument specifies a leading space. That's what we're looking for. The second argument is nothing. That's what we're replacing the leading space with. The final argument is the vector of character values we're searching.
They want to keep all UFO records from the United States. To do this they want to match the state abbreviation column against a pre-set list of all 50 state abbreviations. If a match is found, the state values stays the same and the record is kept. Otherwise the state value is set to NA, which is eventually used to subset the dataframe for only US incidents. The neat thing I learned here is that R has a built-in vector of state abbreviations called
At one point they create a sequence of dates using the
seq.Date function. Here is how to use it:
date.range <- seq.Date(from=as.Date("some earlier date"), to=as.Date("some later date"), by="month")
The by argument can also be years and days.
Near the end they have a column of counts for number of sightings per month per year. For example, the number of sightings in Alaska in 1990-01. Sometimes there is a count, other times there is NA. They want to change the NA to 0. Here's how to do it:
all.sightings[,3][is.na(all.sightings[,3])] <- 0
That about does it. There are other good things in the first chapter, but these are the ones I really thought I could use and didn't want to forget. All the code for chapter 1 (and the other chapters) is available on github. Just beware the code you can download is different from the code in the book. However I think the code online is better. For example the book manually creates a list of all 50 state abbreviations, whereas the code on github uses the state.abb vector I mentioned in Number 3 above.