Machine Learning for Hackers, Chapter 2

Chapter 2 of Machine Learning for Hackers is called Data Exploration. It explains means, medians, quartiles, variance, histograms, scatterplots, things like that. It’s a quick and effective introduction to some basic statistics. If you know stats pretty well you can probably skip it, but I found it pretty useful for its intro to the ggplot2 package.

The data they explore is collection of heights and weights for males and females. You can download itÂ here. The authors explore the data by creating histograms and scatterplots.

Here’s how they create a histogram of the heights:

ggplot(ch2, aes(x = Height)) + geom_histogram(binwidth=1)

Changing the binwidth parameter changes the size of the bins in inches.Â They also create kernel density estimates (KDE):

ggplot(ch2, aes(x = Height)) + geom_density()

Notice that line of code is the same as the histogram code but with a different function called after the “+”.Â Now that plots all the height data. They then show you how create two different KDEs for each gender:

ggplot(ch2, aes(x = Height, fill=Gender)) + geom_density() + facet_grid(Gender ~ .)

Histograms are good for one variable. If you want to explore two numeric variables, you bust out a scatterplot. Continuing with ggplot, they do the following:

ggplot(ch2, aes(x = Height, y = Weight)) + geom_point()

There’s your scatterplot. Next they show how to easily run a smooth prediction line through the plot with a confidence band:

ggplot(ch2, aes(x = Height, y = Weight)) + geom_point() + geom_smooth()

Then they show you again how to distinguish between gender:

ggplot(ch2, aes(x = Height, y = Weight, color = Gender)) + geom_point()

Using this last scatterplot with genders called out,Â they decide to give a sneak preview of machine learning using a simple classification model. First theyÂ code the genders as Male=1 and Female=0, like so:

ch2 <- transform(ch2, Male = ifelse(Gender == 'Male', 1, 0)

Then they use glm to create a logit model that attempts to predict gender based on height and weight:

logit.model <- glm(Male ~ Height + Weight, data=ch2, family=binomial)

Finally they redraw the scatterplot, but this time use the logit model parameters to draw a "separating hyperplane" through the scatterplot:

ggplot(ch2, aes(x = Height, y = Weight, color = Gender)) + geom_point() +
stat_abline(intercept = - coef(logit.model)[1] / coef(logit.model)[3],
slope = - coef(logit.model)[2] / coef(logit.model)[3],
geom = 'abline', color = 'black')

The formula for the line is $y = - \frac{\alpha}{\beta_{2}} - \frac{\beta_{1}}{\beta_{2}}$. They don't tell you that in the book, but I thought I would state that for the record. Also the code in the book for this portion has typos and draws something different. What I provide above replicates Figure 2-31 in the book.

Again I felt like this was a solid intro to the ggplot package. And of course it's never bad to review the basics. "You have to be sound in the fundamentals" as sports analysts are fond of saying.

Machine Learning for Hackers, Chapter 1

I’ve started working through the book, Machine Learning for Hackers, and decided I would document what I learned or found interesting in each chapter. Chapter 1, titled “Using R”, gives an intro to R in the form of cleaning and prepping UFO data for analysis. I wouldn’t recommend this as a true introduction to R. It’s basically the authors preparing a dataset and stopping after each set of commands to explain what they did. I’ve been an R user for a few years now and I had to work a little to follow everything they did. No problem there. I’m always up for learning something new and long ago realized I will never know all there is to know about R. But I don’t think Chapter 1 of Â Machine Learning for Hackers is good place to start if you’re new to R. Anyway, on to what I want to remember.

Number 1

The dataset they work with has two columns of dates in the format YYYYMMDD. They want to convert the dates to actual date types. To do that use the as.Date function. But before they can do that, they need to identify dates that are less than or more than 8 characters long. Here’s how they do it:

good.rows <- ifelse(nchar(ufo$DateOccurred) != 8 |Â nchar(ufo$DateReported) != 8,Â FALSE,Â TRUE)

This creates a vector of TRUE/FALSE values that is the same length as the original UFO dataset. If either of the two dates have character length not equal to 8, FALSE is returned. Otherwise TRUE. Next they use the vector of T/F values (called "good.rows") to subset the UFO data:

ufo <- ufo[good.rows,]

This keeps only those records (or rows) that have a corresponding TRUE value in the good.rows vector. So if row 25 in ufo has a TRUE value in "row" 25 of good.rows, that entire row is saved. The new ufo dataset will only have rows where both date values are exactly 8 characters long.

Number 2

They write a function to do some data cleaning. In the function they use the gsub function to remove leading white spaces from each character, as follows:

gsub("^ ","",split.location)

The first argument specifies a leading space. That's what we're looking for. The second argument is nothing. That's what we're replacing the leading space with. The final argument is the vector of character values we're searching.

Number 3

They want to keep all UFO records from the United States. To do this they want to match the state abbreviation column against a pre-set list of all 50 state abbreviations. If a match is found, the state values stays the same and the record is kept. Otherwise the state value is set to NA, which is eventually used to subset the dataframe for only US incidents. The neat thing I learned here is that R has a built-in vector of state abbreviations called state.abb.

Number 4

At one point they create a sequence of dates using the seq.Date function. Here is how to use it:

date.range <- seq.Date(from=as.Date("some earlier date"),
to=as.Date("some later date"), by="month")

The by argument can also be years and days, and probably other things, too.

Number 5

Near the end they have a column of counts for number of sightings per month per year. For example, the number of sightings in Alaska in 1990-01. Sometimes there is a count, other times there is NA. They want to change the NA to 0. Here's how to do it:

all.sightings[,3][is.na(all.sightings[,3])] <- 0

That about does it. There are other good things in the first chapter, but these are the ones I really thought I could use and didn't want to forget. All the code for chapter 1 (and the other chapters) is available on github. Just beware the code you can download is different from the code in the book. However I think the code online is better. For example the book manually creates a list of all 50 state abbreviations, whereas the code on github uses the state.abb vector I mentioned in Number 3 above.