Machine Learning for Hackers, Chapter 11

Are you a heavy Twitter user? Would you like to build a custom recommendation system in R that will recommend new Twitter feeds for you to follow? If you answered “yes” to both questions then this is the chapter for you. Here’s the code. Have fun.

Now I don’t use Twitter, so this chapter was not very interesting for me. I should probably use Twitter and get with it and join the 2010’s. But at the moment I don’t, so building a Twitter recommendation system in R doesn’t do me any good. I guess I could have followed along and built the recommendation system in the book for one of the authors, but I just couldn’t motivate. It’s a lot of code that requires new packages. Plus there’s a section that requires using a program outside of R called Gephi. It’s an extremely specific case study. It’s not like some of the other chapters where they introduce a general machine learning method, like kNN or PCA, and show an example of how to use it. There is nothing general about this chapter. At one point they do make use of hierarchical clustering to find similarities between Twitter users. Clustering is a common machine learning tactic. But they only give it a couple of pages. I hate to sound down on the chapter. It’s well done and the authors put a lot of work into it. Personally, though, I just couldn’t get excited about it.

Having said that, I did see some interesting things in the R code that I wanted to file away and remember. The first was creating an empty vector. In a loop they run, they sometimes get a null result and want to store an empty vector for that pass. They do this as follows:

y <- c(integer(0), integer(0))

What caught my eye was the way rbind handles those vectors. It ignores them. I could see that being potentially useful. Here's a toy demonstration:

x <- c(5, 4)
y <- c(integer(0), integer(0))
z <- c(3, 2)
rbind(x,y,z)
   [,1] [,2]
x     5    4
z     3    2

Another snippet I wanted to note was how to identify unique values in a matrix. You use the unique function, which I was familiar with. What I didn't know was that you can do it for a matrix by calling the columns of the matrix, like this:

mat <- matrix(floor(runif(10,2,8)),5,2)
mat
           [,1] [,2]
[1,]        4      3
[2,]        2      7
[3,]        7      2
[4,]        3      7
[5,]        5      7
> unique(c(mat[,1],mat[,2]))
[1] 4 2 7 3 5

This chapter also used a base function that I had never seen before called duplicated. It returns a vector of True/False values that tells you whether or not the value occurs previously in the vector. I was surprised I hadn't heard of it before. Here's a demo of duplicated:

x <- c(1:5,2:6)
x
[1] 1 2 3 4 5 2 3 4 5 6
duplicated(x)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
!duplicated(x)
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
x[duplicated(x)]
[1] 2 3 4 5
x[!duplicated(x)]
[1] 1 2 3 4 5 6

Finally they make use of sapply in one of their functions and I wanted to document its use. The suite of apply functions always give me trouble for some reason. I get the gist of what they do, I just have a hard time remember why I would want to use one over the other. Anyway, what they do is use sapply to run a function over each row of a matrix. That seems like a useful thing to remember. First let's create a 100 x 3 data frame of values:

a <- rnorm(100,2,4)
b <- rnorm(100,3,4)
c <- rnorm(100,5,4)
df <- data.frame(a,b,c)

Now, let's say I wanted to create a True/False vector indicating which rows contained a value less than 0. Here's how I can do that using sapply:

sapply(1:nrow(df), function(i) ifelse(any(df[i,]<0),1,0))

The authors did something similar to that and then used the vector of T/F values to subset another vector. Of course if I wanted to subset my example data frame above to show only rows that contain a value less than o, then I could do this:

subset(df, a < 0 | b <0 | c < 0)

So that's the helpful R code I gleaned from Chapter 11. I feel like there should be more. It's a long chapter with a lot of code. But much of it either involves functions from specific packages or uses common R functions I've already documented or know about. Next up is chapter 12, the final chapter.

Clay Ford

Leave a Reply Cancel reply