Monthly Archives: November 2012

Machine Learning for Hackers, Chapter 5

This chapter covers linear regression. The case study is predicting page views for web sites given unique visitors, whether the site advertises, and whether the site is in English. The funny thing is that while they spend a lot of time explaining how to do linear regression in R, they never get around to explaining how to make a prediction using your regression model. (The name of the chapter is actually called “Regression: Predicting Page Views”.) I’ll explain how to do prediction in this post. But first I want to say the authors do a fine job of explaining the “big idea” of linear regression. It’s not an easy topic to cover in under 30 pages. If you’re new to the subject this is a great introduction. I highly recommend it. (And we all know my recommendation means a great deal.)

So let’s say you follow along with the book and a create a linear model to predict page views given unique visitors:

lm.fit <- lm(log(PageViews) ~ log(UniqueVisitors), data = top.1000.sites)

 

Notice the log transformation. This is necessary since the data is not normally distributed. The results of the regression are as follows:

Coefficients:
                   Estimate Std. Error t value   Pr(>|t|) 
(Intercept)        -2.83441    0.75201  -3.769   0.000173 ***
log(UniqueVisitors) 1.33628    0.04568  29.251    < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.084 on 998 degrees of freedom
Multiple R-squared: 0.4616, Adjusted R-squared: 0.4611 
F-statistic: 855.6 on 1 and 998 DF, p-value: < 2.2e-16

 

If we want to predict Page Views for a site with, say, 500,000,000 Unique Visitors, we use the coefficient estimates to calculate the following:

-2.83441 + 1.33628*log(500,000,000)

To easily do this in R we can submit code like this:

new <- data.frame(UniqueVisitors = 500000000)
exp(predict(lm.fit, new, interval = "prediction"))

 

This gives us not only a prediction (fit) but a prediction interval as well (lwr and upr):

fit           lwr        upr
1 24732846396 2876053856 2.12692e+11

 

Also notice we had to wrap the predict function in the exp function. That's because we ran the regression with the data transformed in the log scale. Once we make our prediction we need to transform it back to the original scale. I'm surprised the authors don't cover this. They do mention the predict function as a way to see what the regression model predicts for the original data values. But they don't talk about how to use your regression results to make a prediction for a new value that wasn't in your data. Perhaps that will be covered in future chapters.

Another minor complaint I have is a statement they make on page 150 in the 2nd paragraph: "...the Residual standard error is simply the RMSE of the model that we could compute using sqrt(mean(residuals(lm.git) ^ 2))" where lm.fit is the model object created when regressing Page Views on Unique Visitors. That's incorrect. The residual standard error is computed for this model with sqrt((sum(residuals(lm.fit)^2))/998), where 998 = 1000 - 2. In other words, instead of dividing the sum of the squared residuals by 1000 (i.e., taking the mean), we divide by the number of observations minus the number of coefficients in our model. In this case we have 1000 observations and two coefficients in our model (the intercept and slope). So we divide by 998.

While I didn't learn anything new about linear regression in this chapter, I did learn something about plotting lm model objects in R. I've known for a while you can easily create diagnostic plots by submitting the plot function with a model object as the argument. This creates 4 plots in a 2 x 2 grid. What I learned in this chapter is that you can specify which plot is produced by issuing an additional argument "which = ". For example, to just see a plot of residuals versus fitted values, you submit

plot(lm.fit, which = 1)

 

The book says issuing this statement will "plot the residuals against the truth." I think "truth" should say "fitted values". I'm not sure I've ever heard "fitted values" referred to as the "truth".

Anyway, as I said, I think this chapter is a solid introduction to linear regression. That's a good thing because it appears the next chapter steps it up a notch.

Machine Learning for Hackers, Chapter 4

This chapter explains how to create a ranking algorithm to prioritize email messages in your inbox. The authors mention Google popularizing the idea of creating a “priority inbox” and point to a paper that describes their strategies. They then proceed to use some of these strategies to create their own ranking program using emails from the SpamAssassin public corpus.

This was a long and unpleasant chapter for me to slog through. For one I don’t think I will ever be asked to design a ranking algorithm for emails. I suppose this concept could be extended to other applications but the chapter doesn’t go there. Another sticking point was that the algorithm the authors build is necessarily incomplete because we’re only dealing with emails received (instead of both emails sent and received). The authors acknowledge this up front and say “the methods and algorithms used in this chapter should be considered as exercises only and not examples of how enterprise priority inbox systems should be implemented.” So I was learning how to implement something sort of like what Google does (but not really) that isn’t actually a traditional machine learning method. Personally that wasn’t huge motivation to pick through the authors’ butt-load of R code and figure out what they were doing. I don’t mean to be critical of the authors. I’m impressed with what they did. I mean, think about it: they wrote a dense 33-page chapter on how to build a ranking program in R as an exercise. That’s dedication. I guess I just couldn’t get rid of that nagging feeling that I would never make use of the material in this chapter in my professional life. So my goal became to study their R code and focus on becoming a better programmer.

When all was said and done, their R code wasn’t too hard to follow. I thought they did a great job of breaking everything down into manageable functions. In fact I would say what they do in this chapter is provide a good model for building a long, complicated R program. Having said that, the portions that jumped out at me as being both unfamiliar and potentially useful were the sections that used regular expressions.

I took a 4-week Coursera class that touched briefly on regular expressions. I also found this tutorial that I thought was useful. Other than that I really don’t know much about them. My impression of regular expressions is that unless you use them on a regular basis you’re not going to master them. So my strategy is to just know what they’re used for and remember that. When the time comes, I’ll have to dig in and figure out how to write them, because there is no way I’m going to teach myself how to use them and have total recall several months (or years) from now.

The first instance they use regular expressions is to go through an email message and pull out the lines that contain “From:” in order to eventually strip out an email address:

msg.vec[grepl("From: ", msg.vec)]

The grepl function takes a regular expression as its first argument and returns a logical vector of TRUE/FALSE the same length as its second argument, the vector it combs looking for matches. So here msg.vec is the email message. grepl goes through each line and looks for the text “From: “. If it finds it, it returns TRUE, otherwise FALSE. This is no doubt the easiest example of a regular expression because it’s a literal string. But that first argument could be a really sophisticated regular expression. Finally, grepl is used to index msg.vec such that only lines with “From: ” are returned.

Now the lines with “From: ” contain extraneous characters such as angle brackets and of course colons. To address this they use the strsplit function, which splits a character element into a list by a given regular expression. Here’s how they use it:

strsplit(from, '[":<> ]')

where “from” is the character element. Here’s an example of how it works:

test <- "From: Test Guy "
strsplit(test, '[":<> ]')
 [[1]]
 [1] "From" "" "Test" "Guy" ""
 [6] "test@guy.com"

You can see it splits the character element by the characters in the regular expression. (The square brackets mean match anything inside the square brackets for one character position once and only once.) If I wanted to extract the email address, I could do the following:

grepl("@", test2[[1]])
 [1] FALSE FALSE FALSE FALSE FALSE TRUE

That shows me which element contains the email address. So let’s save our strsplit result and use grepl to pull out the email address:

test2 <- strsplit(test, '[":<> ]')
test2[[1]][grepl("@", test2[[1]])]
 [1] "test@guy.com"

Another function they use is the gsub function which basically does a search and replace based on a regular expression. Here’s an example:

test <- "   20 Dec 2011      "
test
 [1] "   20 Dec 2011     "
gsub("^\\s+|\\s+$", "", test)
 [1] "20 Dec 2011"

Notice the spaces before and after the date. The gsub function searches for leading or trailing spaces and replaces them with nothing. And behold that regular expression. Pretty, isn't it? The caret ^ means search the beginning, the "\s" means match any whitespace characters. The extra backslash "\" in front of it means escape the backslash in front of the "s"! (This is unique to R, I think.) The + sign means match the previous character 1 or more times. The pipe "|" means "or". So search for leading spaces OR trailing spaces.  After the pipe is the search string for trailing spaces. It's just like the search string for leading spaces, except instead of a caret at the front, it has a dollar sign at the end, which means look only at the end of the target string.

Again, there's a massive amount of R programming going on in this chapter and I barely even began to scratch the outside of the wrapping around the surface. Even though I question how practical the actual program is, I do think it's an excellent example of how to organize and tackle a complicated and long project in R.

Simulating the Monty Hall Problem

Back in 1990, the following question was posed to PARADE magazine columnist, Marilyn Vos Savant (once holder of the Guinness World Record for highest IQ):

Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car, behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He says to you, “Do you want to pick door #2?” Is it to your advantage to switch your choice of doors?

Marilyn’s response: “Yes; you should switch. The first door has a 1/3 chance of winning, but the second door has a 2/3 chance.” And thus began a firestorm of controversy. Over a thousand people with PhDs (many of them statisticians and mathematicians) wrote in to her column to tell her she was wrong. Wikipedia has a nice summary of how it went down. And Marylin’s own site has a nice sample of some of the thousands of letters she received. In the end, though, she was right.

If you want to know why she was right, visit the links above. You won’t believe how many ways there are to analyze this seemingly simple problem in probability. What I want to do in this post is simulate it using R. If what she said was right, I should win 2/3 of the time.

So here’s what we do. First define a variable to store the number of simulations and a vector to store the results of each simulation.

# number of simulations
n <- 10000
# allocate memory to store results
x <- vector(mode = "integer", length = n)

Now we do the simulation. Picking the first door can be thought of as a single binomial trial with p = 1/3. We either pick the car (1) or we don't (0). We have a 2/3 chance of not picking the car. After we pick our door, we know another door will be opened to reveal a goat. That happens every time. That means we'll either switch to a goat or to a car. Hence the "if" statement: if we picked the car (1), switch to goat (0); else switch to car(1).

# simulation
for (j in 1:n){
 # pick a door; either goat=0 (p=2/3) or car=1 (p=1/3)
 x[j] <- rbinom(1,1,prob=(1/3)) 
 # switch; if 1, switch to 0, and vice versa
 if (x[j] == 1) {
 x[j] <- 0
 } else x[j] <- 1
}

When the simulation is done, calculate the percentage of times you switched to a car (1):

sum(x)/n # percentage of time you get car

When I ran this I got 0.6667, which is in excellent agreement with the theoretical answer of 2/3. Yay winning cars on game shows!

Machine Learning for Hackers, Chapter 3

In this chapter you build a naïve Bayes classifier to classify an email as spam or not spam (or “ham” as they call it) using only the body of the message. The idea is to take a bunch of spam messages and build a Term Document Matrix (TDM). A TDM is a big grid where the row labels are the terms in the spam emails and the column headers are the spam emails. If row 21 is “viagra” and column 20 is the 20th spam email, their cell [21,20] displays the number of times the word “viagra” appeared in that email. Using the TDM they create a data frame which contains the terms and the percentage of documents in which a given term occurs. For example, of the 500 spam emails, the term “html” appears in 33% of the messages. That gets treated as a conditional probability. The probability of seeing the term “html” in an email given the email is spam is 0.33. The same process is carried out for emails that are not spam. So we have two data frames of terms (one for spam and one for not spam) and their estimated probability of occurrence.

To classify a new email as either spam or not spam we compare it to both data frames and see which terms in the email are in the respective data frame. If a term is in the date frame, we pull out its probability. If not, we assign it a low probability of 0.000001. We multiply the probabilities together along with the prior probability of the message being spam, which they set as 0.5 for starters.

For example, say we have a message with 5 words in the body. None of the words in the email appear in the spam data frame, so we calculate \( 0.5*0.000001^{5}\). However say two of the words appear in the good email data frame, each with probability of 0.05 of occurring. So we calculate \( 0.5*0.05^{2}*0.000001^{3}\). Since the latter calculation is greater than the former, we classify the email as not spam. That’s how you do naïve Bayes classification.

Now if you’re like me you’re probably thinking, where’s the denominator? I mean, I’m no Bayesian expert, but I do know Bayes’s theorem has a denominator. Well it turns out the denominator is the same in both calculations so they drop it. (I have to give a shout out to this guy’s explanation. That’s where I went to figure it out.) You’ll also notice we’re multiplying probabilities as if they’re independent. That’s the naïve assumption here.

Using the language of the probability, here’s Bayes’s Theorem for two terms, where A = message is spam, B = some word, and C = some other word

$$ P(A|B \cap C ) = \frac{ P(B|A \cap C ) P(C|A) P(A) }{ P(B|C) P(C) }$$

For not spam this is…

$$ P(A^{\prime} | B \cap C) = \frac{P(B|A^{\prime} \cap C)P(C|A^{\prime})P(A^{\prime})}{P(B|C)P(C)}$$

Notice the denominators are the same so we can drop them when comparing the two. Since we’re also assuming independence, the numerator changes to \( P(B|A)P(C|A)P(A)\) and \(P(B|A^{\prime})P(C|A^{\prime})P(A^{\prime})\), respectively.

The book does not explain the classification algorithm in the detail I just did. I guess since the book is “for hackers” they didn’t want to get bogged down in math formulas and explanations of Bayes’s Theorem. I get that. But still I wanted to understand why it worked and not just how to do it R.

The R code itself for this chapter is not too hard to understand. A few nuggets I got out of it are…

The readLines() function will return each line of text as a separate element of a character vector.
The dir() function will return a listing of all file names in a directory.
The intersect() function will return the values in common between two vectors.

Machine Learning for Hackers, Chapter 2

Chapter 2 of Machine Learning for Hackers is called Data Exploration. It explains means, medians, quartiles, variance, histograms, scatterplots, things like that. It’s a quick and effective introduction to some basic statistics. If you know stats pretty well you can probably skip it, but I found it pretty useful for its intro to the ggplot2 package.

The data they explore is collection of heights and weights for males and females. You can download it here. The authors explore the data by creating histograms and scatterplots.

Here’s how they create a histogram of the heights:

ggplot(ch2, aes(x = Height)) + geom_histogram(binwidth=1)


Changing the binwidth parameter changes the size of the bins in inches. They also create kernel density estimates (KDE):

ggplot(ch2, aes(x = Height)) + geom_density()


Notice that line of code is the same as the histogram code but with a different function called after the “+”. Now that plots all the height data. They then show you how create two different KDEs for each gender:

ggplot(ch2, aes(x = Height, fill=Gender)) + geom_density() + facet_grid(Gender ~ .)


Histograms are good for one variable. If you want to explore two numeric variables, you bust out a scatterplot. Continuing with ggplot, they do the following:

ggplot(ch2, aes(x = Height, y = Weight)) + geom_point()


There’s your scatterplot. Next they show how to easily run a smooth prediction line through the plot with a confidence band:

ggplot(ch2, aes(x = Height, y = Weight)) + geom_point() + geom_smooth()


Then they show you again how to distinguish between gender:

ggplot(ch2, aes(x = Height, y = Weight, color = Gender)) + geom_point()


Using this last scatterplot with genders called out, they decide to give a sneak preview of machine learning using a simple classification model. First they code the genders as Male=1 and Female=0, like so:

ch2 <- transform(ch2, Male = ifelse(Gender == 'Male', 1, 0)

Then they use glm to create a logit model that attempts to predict gender based on height and weight:

logit.model <- glm(Male ~ Height + Weight, data=ch2, family=binomial)

Finally they redraw the scatterplot, but this time use the logit model parameters to draw a "separating hyperplane" through the scatterplot:

ggplot(ch2, aes(x = Height, y = Weight, color = Gender)) + geom_point() +
 stat_abline(intercept = - coef(logit.model)[1] / coef(logit.model)[3],
 slope = - coef(logit.model)[2] / coef(logit.model)[3],
 geom = 'abline', color = 'black')


The formula for the line is \( y = - \frac{\alpha}{\beta_{2}} - \frac{\beta_{1}}{\beta_{2}}\). They don't tell you that in the book, but I thought I would state that for the record. Also the code in the book for this portion has typos and draws something different. What I provide above replicates Figure 2-31 in the book.

Again I felt like this was a solid intro to the ggplot package. And of course it's never bad to review the basics. "You have to be sound in the fundamentals" as sports analysts are fond of saying.

Machine Learning for Hackers, Chapter 1

I’ve started working through the book, Machine Learning for Hackers, and decided I would document what I learned or found interesting in each chapter. Chapter 1, titled “Using R”, gives an intro to R in the form of cleaning and prepping UFO data for analysis. I wouldn’t recommend this as a true introduction to R. It’s basically the authors preparing a dataset and stopping after each set of commands to explain what they did. I’ve been an R user for a few years now and I had to work a little to follow everything they did. No problem there. I’m always up for learning something new and long ago realized I will never know all there is to know about R. But I don’t think Chapter 1 of Machine Learning for Hackers is good place to start if you’re new to R. Anyway, on to what I want to remember.

Number 1

The dataset they work with has two columns of dates in the format YYYYMMDD. They want to convert the dates to actual date types. To do that use the as.Date function. But before they can do that, they need to identify dates that are less than or more than 8 characters long. Here’s how they do it:

good.rows <- ifelse(nchar(ufo$DateOccurred) != 8 | 
                         nchar(ufo$DateReported) != 8, FALSE, TRUE)

This creates a vector of TRUE/FALSE values that is the same length as the original UFO dataset. If either of the two dates have character length not equal to 8, FALSE is returned. Otherwise TRUE. Next they use the vector of T/F values (called "good.rows") to subset the UFO data:

ufo <- ufo[good.rows,]

This keeps only those records (or rows) that have a corresponding TRUE value in the good.rows vector. So if row 25 in ufo has a TRUE value in "row" 25 of good.rows, that entire row is saved. The new ufo dataset will only have rows where both date values are exactly 8 characters long.

Number 2

They write a function to do some data cleaning. In the function they use the gsub function to remove leading white spaces from each character, as follows:

gsub("^ ","",split.location)

The first argument specifies a leading space. That's what we're looking for. The second argument is nothing. That's what we're replacing the leading space with. The final argument is the vector of character values we're searching.

Number 3

They want to keep all UFO records from the United States. To do this they want to match the state abbreviation column against a pre-set list of all 50 state abbreviations. If a match is found, the state values stays the same and the record is kept. Otherwise the state value is set to NA, which is eventually used to subset the dataframe for only US incidents. The neat thing I learned here is that R has a built-in vector of state abbreviations called state.abb.

Number 4

At one point they create a sequence of dates using the seq.Date function. Here is how to use it:

date.range <- seq.Date(from=as.Date("some earlier date"), 
                            to=as.Date("some later date"), 
                            by="month")

The by argument can also be years and days.

Number 5

Near the end they have a column of counts for number of sightings per month per year. For example, the number of sightings in Alaska in 1990-01. Sometimes there is a count, other times there is NA. They want to change the NA to 0. Here's how to do it:

all.sightings[,3][is.na(all.sightings[,3])] <- 0

That about does it. There are other good things in the first chapter, but these are the ones I really thought I could use and didn't want to forget. All the code for chapter 1 (and the other chapters) is available on github. Just beware the code you can download is different from the code in the book. However I think the code online is better. For example the book manually creates a list of all 50 state abbreviations, whereas the code on github uses the state.abb vector I mentioned in Number 3 above.