Machine Learning for Hackers, Chapter 4

This chapter explains how to create a ranking algorithm to prioritize email messages in your inbox. The authors mention Google popularizing the idea of creating a “priority inbox” and point to a paper that describes their strategies. They then proceed to use some of these strategies to create their own ranking program using emails from the SpamAssassin public corpus.

This was a long and unpleasant chapter for me to slog through. For one I don’t think I will ever be asked to design a ranking algorithm for emails. I suppose this concept could be extended to other applications but the chapter doesn’t go there. Another sticking point was that the algorithm the authors build is necessarily incomplete because we’re only dealing with emails received (instead of both emails sent and received). The authors acknowledge this up front and say “the methods and algorithms used in this chapter should be considered as exercises only and not examples of how enterprise priority inbox systems should be implemented.” So I was learning how to implement something sort of like what Google does (but not really) that isn’t actually a traditional machine learning method. Personally that wasn’t huge motivation to pick through the authors’ butt-load of R code and figure out what they were doing. I don’t mean to be critical of the authors. I’m impressed with what they did. I mean, think about it: they wrote a dense 33-page chapter on how to build a ranking program in R as an exercise. That’s dedication. I guess I just couldn’t get rid of that nagging feeling that I would never make use of the material in this chapter in my professional life. So my goal became to study their R code and focus on becoming a better programmer.

When all was said and done, their R code wasn’t too hard to follow. I thought they did a great job of breaking everything down into manageable functions. In fact I would say what they do in this chapter is provide a good model for building a long, complicated R program. Having said that, the portions that jumped out at me as being both unfamiliar and potentially useful were the sections that used regular expressions.

I took a 4-week Coursera class that touched briefly on regular expressions. I also found this tutorial that I thought was useful. Other than that I really don’t know much about them. My impression of regular expressions is that unless you use them on a regular basis you’re not going to master them. So my strategy is to just know what they’re used for and remember that. When the time comes, I’ll have to dig in and figure out how to write them, because there is no way I’m going to teach myself how to use them and have total recall several months (or years) from now.

The first instance they use regular expressions is to go through an email message and pull out the lines that contain “From:” in order to eventually strip out an email address:

msg.vec[grepl("From: ", msg.vec)]

The grepl function takes a regular expression as its first argument and returns a logical vector of TRUE/FALSE the same length as its second argument, the vector it combs looking for matches. So here msg.vec is the email message. grepl goes through each line and looks for the text “From: “. If it finds it, it returns TRUE, otherwise FALSE. This is no doubt the easiest example of a regular expression because it’s a literal string. But that first argument could be a really sophisticated regular expression. Finally, grepl is used to index msg.vec such that only lines with “From: ” are returned.

Now the lines with “From: ” contain extraneous characters such as angle brackets and of course colons. To address this they use the strsplit function, which splits a character element into a list by a given regular expression. Here’s how they use it:

strsplit(from, '[":<> ]')

where “from” is the character element. Here’s an example of how it works:

test <- "From: Test Guy "
strsplit(test, '[":<> ]')
 [1] "From" "" "Test" "Guy" ""
 [6] ""

You can see it splits the character element by the characters in the regular expression. (The square brackets mean match anything inside the square brackets for one character position once and only once.) If I wanted to extract the email address, I could do the following:

grepl("@", test2[[1]])

That shows me which element contains the email address. So let’s save our strsplit result and use grepl to pull out the email address:

test2 <- strsplit(test, '[":<> ]')
test2[[1]][grepl("@", test2[[1]])]
 [1] ""

Another function they use is the gsub function which basically does a search and replace based on a regular expression. Here’s an example:

test <- "   20 Dec 2011      "
 [1] "   20 Dec 2011     "
gsub("^\\s+|\\s+$", "", test)
 [1] "20 Dec 2011"

Notice the spaces before and after the date. The gsub function searches for leading or trailing spaces and replaces them with nothing. And behold that regular expression. Pretty, isn't it? The caret ^ means search the beginning, the "\s" means match any whitespace characters. The extra backslash "\" in front of it means escape the backslash in front of the "s"! (This is unique to R, I think.) The + sign means match the previous character 1 or more times. The pipe "|" means "or". So search for leading spaces OR trailing spaces.  After the pipe is the search string for trailing spaces. It's just like the search string for leading spaces, except instead of a caret at the front, it has a dollar sign at the end, which means look only at the end of the target string.

Again, there's a massive amount of R programming going on in this chapter and I barely even began to scratch the outside of the wrapping around the surface. Even though I question how practical the actual program is, I do think it's an excellent example of how to organize and tackle a complicated and long project in R.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.