Machine Learning for Hackers, Chapter 1 – statistics you can probably trust

I’ve started working through the book, Machine Learning for Hackers, and decided I would document what I learned or found interesting in each chapter. Chapter 1, titled “Using R”, gives an intro to R in the form of cleaning and prepping UFO data for analysis. I wouldn’t recommend this as a true introduction to R. It’s basically the authors preparing a dataset and stopping after each set of commands to explain what they did. I’ve been an R user for a few years now and I had to work a little to follow everything they did. No problem there. I’m always up for learning something new and long ago realized I will never know all there is to know about R. But I don’t think Chapter 1 of Machine Learning for Hackers is good place to start if you’re new to R. Anyway, on to what I want to remember.

Number 1

The dataset they work with has two columns of dates in the format YYYYMMDD. They want to convert the dates to actual date types. To do that use the as.Date function. But before they can do that, they need to identify dates that are less than or more than 8 characters long. Here’s how they do it:

good.rows <- ifelse(nchar(ufo$DateOccurred) != 8 | 
                         nchar(ufo$DateReported) != 8, FALSE, TRUE)

This creates a vector of TRUE/FALSE values that is the same length as the original UFO dataset. If either of the two dates have character length not equal to 8, FALSE is returned. Otherwise TRUE. Next they use the vector of T/F values (called "good.rows") to subset the UFO data:

ufo <- ufo[good.rows,]

This keeps only those records (or rows) that have a corresponding TRUE value in the good.rows vector. So if row 25 in ufo has a TRUE value in "row" 25 of good.rows, that entire row is saved. The new ufo dataset will only have rows where both date values are exactly 8 characters long.

Number 2

They write a function to do some data cleaning. In the function they use the gsub function to remove leading white spaces from each character, as follows:

gsub("^ ","",split.location)

The first argument specifies a leading space. That's what we're looking for. The second argument is nothing. That's what we're replacing the leading space with. The final argument is the vector of character values we're searching.

Number 3

They want to keep all UFO records from the United States. To do this they want to match the state abbreviation column against a pre-set list of all 50 state abbreviations. If a match is found, the state values stays the same and the record is kept. Otherwise the state value is set to NA, which is eventually used to subset the dataframe for only US incidents. The neat thing I learned here is that R has a built-in vector of state abbreviations called state.abb.

Number 4

At one point they create a sequence of dates using the seq.Date function. Here is how to use it:

date.range <- seq.Date(from=as.Date("some earlier date"), 
                            to=as.Date("some later date"), 
                            by="month")

The by argument can also be years and days.

Number 5

Near the end they have a column of counts for number of sightings per month per year. For example, the number of sightings in Alaska in 1990-01. Sometimes there is a count, other times there is NA. They want to change the NA to 0. Here's how to do it:

all.sightings[,3][is.na(all.sightings[,3])] <- 0

That about does it. There are other good things in the first chapter, but these are the ones I really thought I could use and didn't want to forget. All the code for chapter 1 (and the other chapters) is available on github. Just beware the code you can download is different from the code in the book. However I think the code online is better. For example the book manually creates a list of all 50 state abbreviations, whereas the code on github uses the state.abb vector I mentioned in Number 3 above.

Clay Ford

4 comments

I actually had a question about this chapter.

I know with the new version of R and the installation of ggplot2. when you load “library(ggplot), it’ll only load the single library vs previous required libraries plyr and reshape. Well now that I try to load each individual library

library(ggplot2)
library(plyr)
library(reshape)

and the enter

ufo<-read.delim("data/ufo/ufo_awesome.tsv", sep="\t", stringsAsFactors=FALSE, header=FALSE, na.strings="")

none of the completed data sets pop up.

I mean I can see the computer tacking the resources for a few seconds, but no results appear afterwards.

ctlr says:

December 11, 2012 at 3:29 am

I’m not sure I understand your question. When you submit the code to create the ufo data set, nothing should pop up. In fact, if it worked, nothing will happen on screen. You just get a fresh prompt waiting for your next command. If you want to see a little of the data you can submit “head(ufo)”, which is what they do on p. 14.

Reply

Hi, I can’t find anywhere the UFO dataset used in Chapter1 of Machine Learning for Hackers: the link to infochimps doesn’t work. Could you help me please? Can you advise me about where to find this dataset? I really need it to practice with R by using that book.
Thank you.

Clay Ford says:

August 16, 2015 at 10:05 am

The data set is on GitHub in the repository for the book’s code: https://github.com/johnmyleswhite/ML_for_Hackers. Drill down into 01-Introduction – data – ufo. The data is “ufo_awesome.tsv”, where tsv means Tab Separated Values. It’s a big file, about 75 MB.

Reply