Category Archives: Using R

Machine Learning for Hackers, Chapter 8

This is a tidy chapter on principle component analysis (PCA). There are many fine tutorials on PCA to be found on the internet. This one is special because it doesn’t start with a nice data set. They have to first beat it into shape. They’re goal is to develop a market index based on 25 stocks. They will do PCA and use the first component to develop an index.

After they read in a CSV file with the date, they have to translate a column of dates into a date object. They use a package called “lubridate” for this, but it seems the as.Date function works just as well. This is what I did:

prices <- read.csv('data/stock_prices.csv')
prices <- transform(prices, Date = as.Date(Date))

 

Next they have to reshape the data. The prices dataset looks like this:

 head(prices)
  Date     Stock Close
1 2011-05-25 DTE 51.12
2 2011-05-24 DTE 51.51
3 2011-05-23 DTE 51.47
4 2011-05-20 DTE 51.90
5 2011-05-19 DTE 51.91
6 2011-05-18 DTE 51.68

 

They want to transform this so the column headers are the Date and all the Stocks. They use the cast function from the reshape package.

library(reshape)
date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')

 

The result:

        Date   ADC   AFL ARKR AZPN  CLFD   DDR   DTE  ENDP FLWS     FR GMXR
1 2002-01-02 17.70 23.78 8.15 17.10 3.19 18.80 42.37 11.54 15.77 31.16 4.50
2 2002-01-03 16.14 23.52 8.15 17.41 3.27 18.55 42.14 11.48 17.40 31.45 4.37
3 2002-01-04 15.45 23.92 7.79 17.90 3.28 18.46 41.79 11.60 17.11 31.46 4.45
4 2002-01-07 16.59 23.12 7.79 17.49 3.50 18.30 41.48 11.90 17.38 31.10 4.38
5 2002-01-08 16.76 25.54 7.35 17.89 4.24 18.57 40.69 12.41 14.62 31.40 4.30
6 2002-01-09 16.78 26.30 7.40 18.25 4.25 18.41 40.81 12.27 14.27 31.05 4.25

 

Now after they do this they mention the new dataset has missing entries (NAs) that need to be removed. But they don't tell you how they found them. Furthermore they proceed to not just remove a row of NAs, but an entire column of a stock because it had a value on a day when all other stocks were NA! Here's what they do:

# remove a row and a column
prices <- subset(prices, Date != "2002-02-01")
prices <- subset(prices, Stock != "DDR")
date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')

 

That reduces the data set to 2366 rows and 25 columns. Instead of 25 stocks, we now have 24. (Remember, the first column is Date.) What confused me is why they didn't do something like this:

date.stock.matrix <- na.omit(date.stock.matrix)

 

That reduces the data set by two rows to 2365 rows but keeps all the columns. That means you have 25 stocks to develop your index instead of 24. But let's back up for a moment and think about how they found those missing values in the first place. They don't tell you in the book. They just say that they notice they have missing entries and need to remove them. Here's one way you could find out if you didn't know whether or not there were missing values. First get the dimensions of your data frame:

dim(date.stock.matrix)
[1] 2367 26

 

Now run the following to see if the dimensions change:

dim(date.stock.matrix[complete.cases(date.stock.matrix),])
[1] 2365 26

 

They do! Two rows have missing values. But which two rows? Try this

row.has.na <- apply(date.stock.matrix, 1, function(x){any(is.na(x))})
date.stock.matrix[row.has.na,]

 

That returns the two rows in the data frame with missing values, in this case rows 22 and 875. Again, we can use the na.omit code above to clear out those rows. I'm not sure why the authors didn't do this. When they run their PCA on their final data set with 24 stocks, the first component accounts for 29% of the variance. But when I run it on the data set with all 25 stocks and only two rows removed, the first component accounts for almost 33% of the variance. Not a huge improvement, but still improvement.

A couple of other "that was cool" nuggets I took from this chapter were the way they looked at the correlation matrix (to get a feel if PCA was warranted) and their use of the melt function from the reshape package. First the correlation trick:

cor.matrix <- cor(date.stock.matrix[,2:ncol(date.stock.matrix)])
corrs <- as.numeric(cor.matrix)
library(ggplot2)
ggplot(data.frame(Correlation = corrs), aes(x = Correlation, fill = 1))
 + geom_density() + opts(legend.position = 'none')

 

They calculate the correlation matrix of the closing values, convert the correlation matrix to a vector, and then make a density plot. This visually shows you that most of the correlations are positive and that PCA doesn't seem out of the question. I thought that was a neat trick.

Now the melt function. They use this function to convert a data set with both their market index data and the DJI such that both can be plotted over time. Let's look at before and after and see what it does.

Before melt:

 head(comparison)
       Dates MarketIndex       DJI
1 2002-01-02 -1.0346886 -0.3251888
2 2002-01-03 -1.0265119 -0.2602872
3 2002-01-04 -1.0199864 -0.2027079
4 2002-01-07 -1.0230787 -0.2439139
5 2002-01-08 -0.9953525 -0.2744783
6 2002-01-09 -0.9873846 -0.3115893

 

After melt:

alt.comparison <- melt(comparison, id.vars='Dates')
names(alt.comparison) <- c('Dates', 'Index', 'Price')
 head(alt.comparison)
       Dates       Index      Price
1 2002-01-02 MarketIndex -1.0346886
2 2002-01-03 MarketIndex -1.0265119
3 2002-01-04 MarketIndex -1.0199864
4 2002-01-07 MarketIndex -1.0230787
5 2002-01-08 MarketIndex -0.9953525
6 2002-01-09 MarketIndex -0.9873846

 

So what they want to do is plot the indices over time and put them both on the same graph. In the first data set, each row is a date with both index values on that date. The melt function says to identify the other two columns by date. Each column name in the first data set becomes a grouping factor in the second data set. The values that are in columns 2 and 3 of the first data set are now all in column three in the second data set. It's hard to explain in words but makes sense if you stare at the before-and-after long enough.

Machine Learning for Hackers, Chapter 7

The title of this chapter, “Optimization: Breaking Codes”, sounds more exciting than it really is. The code breaking is attempted via an optimization algorithm called the Metropolis method. The authors show you how to encrypt text using a simple Caesar cipher and then how to implement the Metropolis method to attempt to decipher the message. Their test message is “here is some sample text”. They encrypt it using the Caesar cipher, so it looks like “ifsf jt tpnf tbnqmf ufyu “. The encryption rule is simply replacing every letter with the following letter in the alphabet. The letter a is replaced with b, b with c, and so on until you replace z with a. They manage to decipher their test text through the course of 50,000 Metropolis steps (on the 45,609th attempt). I suppose this is a fun demonstration of optimization. It just seemed a little outdated. I guess I was expecting to see them crack an actual message from World War II or something.

They start the chapter by demonstrating how to use the R optim function. For some reason this function has always confused me. Probably because I don’t use it that often. But I have to say I feel much better about it now. They give clear demonstrations of its use and visually show you how it works. Their example involves the height and weight dataset from chapter 2. First we do linear regression and find coefficients for the slope and intercept. Then we find the same coefficients using optim. The function minimized in the call to optim is a hand-built sum of squared errors function. I thought this was a great way to demonstrate how optim works. But after this, they do something puzzling: they continue to demonstrate how to use optim with Ridge Regression. Now Ridge Regression is a type of regression that helps remedy multicollinearity. If you have predictor variables that are highly correlated, you may get some weird coefficients in your linear modelling results. I’m talking about getting a negative value where you would expect a positive value, that sort of thing. Ridge regression helps address this by “modifying the method of least squares to allow biased estimators of the regression coefficients” (Kutner, et al., p. 432). Knowing that, it doesn’t make much sense to demonstrate Ridge regression on height and weight data. There’s only one predictor. I thought maybe they were going to do more with Ridge regression in this chapter, but they don’t. They move on to code breaking. It just seemed out of place to bring up Ridge regression in the context of single predictor regression.

I won’t go into the code breaking program. The code is available on the book’s git hub page. What I wanted to review for my own benefit is their function for encrypting text using the Caesar cipher. I thought it was a good example of using R’s indexing capabilities with list objects. First we need a vector with the English alphabet and two empty list objects for the ciphers: one for encrypting text and another for deciphering text:

english.letters <- letters
caesar.cipher <- list()
inverse.caesar.cipher <- list()

 

Now we create the ciphers:

# need to use index %% 26 + 1 to handle letter "z" at end
# 26 %% 26 + 1 = 0 + 1 = 1
for (index in 1:length(english.letters))
{
 caesar.cipher[[english.letters[index]]] <- english.letters[index %% 26 + 1]
 inverse.caesar.cipher[[english.letters[index %% 26 + 1]]] <- english.letters[index]
}

 

The "index %% 26 + 1" confused me for a couple of reasons. First off, the %% operator means modulus and I'm used to see something like 12 %% 5, where the first number is bigger than the second. The answer to that is of course 2. If you divide 12 by 5 you get remainder 2, and that's what the %% operator returns. But in the code above, you'll notice the "index" value is less than 26 on every iteration except the last. How do you calculate something like 2 %% 26? I wasn't sure so I had to google around and figure it out. It turns out the modulus operation a %% n is defined as follows:

a \, mod \, n = a - \lfloor \frac{a}{n} \rfloor \times n

So, 2 %% 26 is calculated like so:

2 \, mod \, 26 = 2 - \lfloor \frac{2}{26} \rfloor \times 26 = 2 - \lfloor 0.0769 \rfloor \times 26 = 2 - (0 \times 26) = 2

Once I was clear on that, I was confused why we needed to use the %% operator at all. Again if you look at the code above, you'll see "index %% 26 + 1". When index = 2, "index %% 26" gives 2. When index = 3, "index %% 26" gives 3. And so on. Why not just use "index + 1"? Well, when we get to the letter "z", we need to substitute "a". The index number for "a" is 1. But if we do "index + 1", then we would get 27 at the letter "z" since its index number is 26. That would not index anything in the 26 element "english.letters" vector and produce an error. Therefore we need to use the %% operator to allows us to recycle through the numbers 1 - 26. When index = 26 (i.e., when we get to the letter "z"), we do 26 %% 26 + 1, which equals 1 since 26/26 has remainder equal 0.

Now that I understand how the ciphers are generated, I can print them and have a look if I like:

print(caesar.cipher) # encrypt text
print(inverse.caesar.cipher) # decipher text

 

Since these are lists, we investigate with indices using double brackets. For example, to see what we substitute for letter "h", we do the following:

caesar.cipher[["h"]]

 

That returns "i". It seems like a fun extension to this would be to create a general function that allowed you to specify the substitution index. In the Caesar cipher, this is 1. It would be neat to specify, say, 13 so "a" becomes "n", "b" becomes "o", and so on. Anyway, moving on...

Now we use the ciphers we just made to encrypt text. To do this the authors build two functions. The first encrypts text letter-by-letter using the cipher you give it. The second function encrypts words using the first function. Let's have a look at these:

apply.cipher.to.string <- function(string, cipher)
{
 output <- ''
 for (i in 1:nchar(string))
 {
 output <- paste(output, cipher[[substr(string, i, i)]], sep='')
 }
 return(output)
}
apply.cipher.to.text <- function(text, cipher)
{
 output <- c()
 for (string in text)
 {
 output <- c(output, apply.cipher.to.string(string, cipher))
 }
 return(output)
}

 

The first function loops through each letter of a word (or "string"). Using the substr function, we extract each letter, use that letter to look up the substitution in the cipher, and "paste" the substituted letter to the previously substituted letter. At the conclusion of the loop we have encrypted the word using our cipher. The second function uses this first function to encrypt whole phrases. Here is how the authors encrypt the message "here is some sample text" using the Caesar cipher:

apply.cipher.to.text(c('here', 'is', 'some', 'sample','text'), caesar.cipher)

 

That gives us "ifsf" "jt" "tpnf" "tbnqmf" "ufyu" . If we wanted to decipher the text, we would use the inverse cipher as follows:

apply.cipher.to.text(c('ifsf', 'jt', 'tpnf', 'tbnqmf', 'ufyu'), inverse.caesar.cipher)

 

Again it would be a good exercise to generalize these functions to accept a whole text file and return an encrypted text file. Not that I'm going to do that, but it sounds like it would make a good project assignment in some programming class.

Machine Learning for Hackers, Chapter 6

Chapter 6 covers orthogonal polynomial regression, cross-validation, regularization and a touch of logistic regression. The case study is predicting popularity of O’Reilly’s top 100 books based on their back cover descriptions. As usual I picked up some useful tricks from this chapter. Let’s get into it.

If you have a data frame called “df” with a variable x and you want to add, say, x^{2} to it (and name it “X2”), you can do the following:

df <- transform(df, X2 = X ^ 2)

Now it turns out R has a function called poly that does this for you:

poly(X , degree=2, raw=TRUE)

So you could leave your data frame as is (say it has two columns, Y and X) and do polynomial regression like so:

lm(Y ~ poly(X, degree=2, raw=TRUE), data = df)

That's straight-up raw polynomial regression. If you leave out the "raw = TRUE" parameter (or set it to FALSE, its default value), the poly function generates orthogonal polynomials. As the book states: "instead of naively adding simple powers of x, we add more complicated variants of x that work like successive powers of x, but aren't correlated with each other in the way that x and x^{2} are." Two vectors are orthogonal when their dot product equals 0. You can verify the poly produces orthogonal vectors. Here's a simple example:

x <- 1:10
p <- poly(x, 2)
p[,1] %*% p[,2]

That returns 0. I don't understand how R creates these orthogonal vectors. It returns a couple of items called "alpha" and "norm2" which I'm pretty sure have something to do with it. But the help pages and Google searches yielded nothing. (Though I will say this guy's blog post was useful.) Anyway, they illustrate polynomial regression using some toy data:

set.seed(1)
x <- seq(0, 1, by = 0.01)
y <- sin(2 * pi * x) + rnorm(length(x), 0, 0.1)
df <- data.frame(X = x, Y = y)
# raw polynomial regression
m1 <- lm(Y ~ poly(X,degree=5,raw=TRUE), data=df)
# orthogonal polynomial regression
m2 <- lm(Y ~ poly(X,degree=5), data=df)

 

If you want to make predictions using your models, do something like this:

d <- seq(0,1,by = 0.05)
p1 <- predict(m1, data.frame(X=d))
p2 <- predict(m2, data.frame(X=d))

 

If you want to see how your model visually performs, do something like this:

plot(x, y, xlab = "X", ylab = "Y", las = 1, xlim = c(0, 1))
lines(d, p1, col = "red")
lines(d, p2, col = "blue")

 

Finally if you want to update your orthogonal model, you can do something like this:

m1.2 <- lm(Y ~ poly(X,degree=5,raw=TRUE)[,-4], data=df)
m1.3 <- lm(Y ~ poly(X,degree=5,raw=TRUE)[,c(-2,-4)], data=df)

 

Now they only play around with orthogonal regression for the most part and they use cross validation. They present a general way to randomly split data into a testing and training sets and then loop through polynomial degrees evaluating RMSE at each step for both the training and testing sets. See their 14th and 15th code snippets for this code. I think it's quite useful.

After this they go into Regularization, which is some heavy-duty black box stuff. I guess they do a good job of showing you how to do it using the "glmnet" package. Still, though, I have no idea what's going on under the hood. I would feel somewhat disingenuous using this for real. I have no doubt it works. I'm not questioning their application or the glmnet package. I just feel uneasy using methods that I don't understand. I'm always afraid someone will ask me to explain how it works.

Finally they use Regularization and logistic regression to build a model for predicting whether or not a book will be in the top 50 (of the 100 best selling books) given its back-cover description. Instead of summarizing that analysis I wanted to write a little about how they hammer the data into shape to analyze it. First they read in the data, then they load the "tm" package and start calling functions:

ranks <- read.csv('data/oreilly.csv', stringsAsFactors = FALSE)
library('tm')
documents <- data.frame(Text = ranks$Long.Desc.)
row.names(documents) <- 1:nrow(documents) # not sure why this is necessary
corpus <- Corpus(DataframeSource(documents))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('English'))

 

The last four lines is where the magic happens. First they make a "corpus". This is sort of like a list where the elements are text documents. The next line makes everything lower case. The next collapses multiple white space characters to a single blank. The last removes the most common words in English, like "the" and "an". When you're done, you can't view the corpus like you would a data frame. You can't just type the name at the command line and hit enter. You have to use the inspect() function, like this:

inspect(corpus) # all documents
corpus[[2]] # second document

 

Pretty cool. But it's still not ready for regression analysis. You have to produce a Document Term Matrix (DTM) from your corpus. This is done like so:

dtm <- DocumentTermMatrix(corpus)

 

Easy enough. But, again, how to view it? Use the inspect() function, like this:

# first five documents and first five terms
inspect(dtm[1:3,1:5])
# all documents and first five terms
inspect(dtm[,1:5])

 

The rows indicate the document (in this case numbered 1 - 100) and the column headers are the words (or terms). The values in the matrix tell you how many times the term appeared in a document. Here's a quick way to do some analysis on those terms:

# find those terms that occur at least five times
findFreqTerms(dtm, 5)

 

The last step to get everything ready for analysis is to convert the DTM into a matrix:

x <- as.matrix(dtm)

 

We do this because the glmnet function requires the predictors to be in a matrix, and the DTM represents the predictors. (The response is an indicator: top-50 book or not.) Now the dimension of the dtm matrix is 100 x 5605! That's a lot of columns! The authors have this to say: "Because our data set has more columns than rows, unregularized regression will always produce an overfit model. For that reason, we have to use some form of regularization to get any meaningful results." They probably could have eliminated some of the columns. When I inspected the DTM I saw terms such as "!)</li", "#</--", "#</,", "#</em", "#</em," and "&amp;". These are clearly just left over HTML codes. I can't see how they contribute to this analysis. But even if they were removed the matrix would still have many more columns than rows.

 

Machine Learning for Hackers, Chapter 5

This chapter covers linear regression. The case study is predicting page views for web sites given unique visitors, whether the site advertises, and whether the site is in English. The funny thing is that while they spend a lot of time explaining how to do linear regression in R, they never get around to explaining how to make a prediction using your regression model. (The name of the chapter is actually called “Regression: Predicting Page Views”.) I’ll explain how to do prediction in this post. But first I want to say the authors do a fine job of explaining the “big idea” of linear regression. It’s not an easy topic to cover in under 30 pages. If you’re new to the subject this is a great introduction. I highly recommend it. (And we all know my recommendation means a great deal.)

So let’s say you follow along with the book and a create a linear model to predict page views given unique visitors:

lm.fit <- lm(log(PageViews) ~ log(UniqueVisitors), data = top.1000.sites)

 

Notice the log transformation. This is necessary since the data is not normally distributed. The results of the regression are as follows:

Coefficients:
                   Estimate Std. Error t value   Pr(>|t|) 
(Intercept)        -2.83441    0.75201  -3.769   0.000173 ***
log(UniqueVisitors) 1.33628    0.04568  29.251    < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.084 on 998 degrees of freedom
Multiple R-squared: 0.4616, Adjusted R-squared: 0.4611 
F-statistic: 855.6 on 1 and 998 DF, p-value: < 2.2e-16

 

If we want to predict Page Views for a site with, say, 500,000,000 Unique Visitors, we use the coefficient estimates to calculate the following:

-2.83441 + 1.33628*log(500,000,000)

To easily do this in R we can submit code like this:

new <- data.frame(UniqueVisitors = 500000000)
exp(predict(lm.fit, new, interval = "prediction"))

 

This gives us not only a prediction (fit) but a prediction interval as well (lwr and upr):

fit           lwr        upr
1 24732846396 2876053856 2.12692e+11

 

Also notice we had to wrap the predict function in the exp function. That's because we ran the regression with the data transformed in the log scale. Once we make our prediction we need to transform it back to the original scale. I'm surprised the authors don't cover this. They do mention the predict function as a way to see what the regression model predicts for the original data values. But they don't talk about how to use your regression results to make a prediction for a new value that wasn't in your data. Perhaps that will be covered in future chapters.

Another minor complaint I have is a statement they make on page 150 in the 2nd paragraph: "...the Residual standard error is simply the RMSE of the model that we could compute using sqrt(mean(residuals(lm.git) ^ 2))" where lm.fit is the model object created when regressing Page Views on Unique Visitors. That's incorrect. The residual standard error is computed for this model with sqrt((sum(residuals(lm.fit)^2))/998), where 998 = 1000 - 2. In other words, instead of dividing the sum of the squared residuals by 1000 (i.e., taking the mean), we divide by the number of observations minus the number of coefficients in our model. In this case we have 1000 observations and two coefficients in our model (the intercept and slope). So we divide by 998.

While I didn't learn anything new about linear regression in this chapter, I did learn something about plotting lm model objects in R. I've known for a while you can easily create diagnostic plots by submitting the plot function with a model object as the argument. This creates 4 plots in a 2 x 2 grid. What I learned in this chapter is that you can specify which plot is produced by issuing an additional argument "which = ". For example, to just see a plot of residuals versus fitted values, you submit

plot(lm.fit, which = 1)

 

The book says issuing this statement will "plot the residuals against the truth." I think "truth" should say "fitted values". I'm not sure I've ever heard "fitted values" referred to as the "truth".

Anyway, as I said, I think this chapter is a solid introduction to linear regression. That's a good thing because it appears the next chapter steps it up a notch.

Machine Learning for Hackers, Chapter 4

This chapter explains how to create a ranking algorithm to prioritize email messages in your inbox. The authors mention Google popularizing the idea of creating a “priority inbox” and point to a paper that describes their strategies. They then proceed to use some of these strategies to create their own ranking program using emails from the SpamAssassin public corpus.

This was a long and unpleasant chapter for me to slog through. For one I don’t think I will ever be asked to design a ranking algorithm for emails. I suppose this concept could be extended to other applications but the chapter doesn’t go there. Another sticking point was that the algorithm the authors build is necessarily incomplete because we’re only dealing with emails received (instead of both emails sent and received). The authors acknowledge this up front and say “the methods and algorithms used in this chapter should be considered as exercises only and not examples of how enterprise priority inbox systems should be implemented.” So I was learning how to implement something sort of like what Google does (but not really) that isn’t actually a traditional machine learning method. Personally that wasn’t huge motivation to pick through the authors’ butt-load of R code and figure out what they were doing. I don’t mean to be critical of the authors. I’m impressed with what they did. I mean, think about it: they wrote a dense 33-page chapter on how to build a ranking program in R as an exercise. That’s dedication. I guess I just couldn’t get rid of that nagging feeling that I would never make use of the material in this chapter in my professional life. So my goal became to study their R code and focus on becoming a better programmer.

When all was said and done, their R code wasn’t too hard to follow. I thought they did a great job of breaking everything down into manageable functions. In fact I would say what they do in this chapter is provide a good model for building a long, complicated R program. Having said that, the portions that jumped out at me as being both unfamiliar and potentially useful were the sections that used regular expressions.

I took a 4-week Coursera class that touched briefly on regular expressions. I also found this tutorial that I thought was useful. Other than that I really don’t know much about them. My impression of regular expressions is that unless you use them on a regular basis you’re not going to master them. So my strategy is to just know what they’re used for and remember that. When the time comes, I’ll have to dig in and figure out how to write them, because there is no way I’m going to teach myself how to use them and have total recall several months (or years) from now.

The first instance they use regular expressions is to go through an email message and pull out the lines that contain “From:” in order to eventually strip out an email address:

msg.vec[grepl("From: ", msg.vec)]

The grepl function takes a regular expression as its first argument and returns a logical vector of TRUE/FALSE the same length as its second argument, the vector it combs looking for matches. So here msg.vec is the email message. grepl goes through each line and looks for the text “From: “. If it finds it, it returns TRUE, otherwise FALSE. This is no doubt the easiest example of a regular expression because it’s a literal string. But that first argument could be a really sophisticated regular expression. Finally, grepl is used to index msg.vec such that only lines with “From: ” are returned.

Now the lines with “From: ” contain extraneous characters such as angle brackets and of course colons. To address this they use the strsplit function, which splits a character element into a list by a given regular expression. Here’s how they use it:

strsplit(from, '[":<> ]')

where “from” is the character element. Here’s an example of how it works:

test <- "From: Test Guy "
strsplit(test, '[":<> ]')
 [[1]]
 [1] "From" "" "Test" "Guy" ""
 [6] "test@guy.com"

You can see it splits the character element by the characters in the regular expression. (The square brackets mean match anything inside the square brackets for one character position once and only once.) If I wanted to extract the email address, I could do the following:

grepl("@", test2[[1]])
 [1] FALSE FALSE FALSE FALSE FALSE TRUE

That shows me which element contains the email address. So let’s save our strsplit result and use grepl to pull out the email address:

test2 <- strsplit(test, '[":<> ]')
test2[[1]][grepl("@", test2[[1]])]
 [1] "test@guy.com"

Another function they use is the gsub function which basically does a search and replace based on a regular expression. Here’s an example:

test <- "   20 Dec 2011      "
test
 [1] "   20 Dec 2011     "
gsub("^\\s+|\\s+$", "", test)
 [1] "20 Dec 2011"

Notice the spaces before and after the date. The gsub function searches for leading or trailing spaces and replaces them with nothing. And behold that regular expression. Pretty, isn't it? The caret ^ means search the beginning, the "\s" means match any whitespace characters. The extra backslash "\" in front of it means escape the backslash in front of the "s"! (This is unique to R, I think.) The + sign means match the previous character 1 or more times. The pipe "|" means "or". So search for leading spaces OR trailing spaces.  After the pipe is the search string for trailing spaces. It's just like the search string for leading spaces, except instead of a caret at the front, it has a dollar sign at the end, which means look only at the end of the target string.

Again, there's a massive amount of R programming going on in this chapter and I barely even began to scratch the outside of the wrapping around the surface. Even though I question how practical the actual program is, I do think it's an excellent example of how to organize and tackle a complicated and long project in R.

Simulating the Monty Hall Problem

Back in 1990, the following question was posed to PARADE magazine columnist, Marilyn Vos Savant (once holder of the Guinness World Record for highest IQ):

Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car, behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He says to you, “Do you want to pick door #2?” Is it to your advantage to switch your choice of doors?

Marilyn’s response: “Yes; you should switch. The first door has a 1/3 chance of winning, but the second door has a 2/3 chance.” And thus began a firestorm of controversy. Over a thousand people with PhDs (many of them statisticians and mathematicians) wrote in to her column to tell her she was wrong. Wikipedia has a nice summary of how it went down. And Marylin’s own site has a nice sample of some of the thousands of letters she received. In the end, though, she was right.

If you want to know why she was right, visit the links above. You won’t believe how many ways there are to analyze this seemingly simple problem in probability. What I want to do in this post is simulate it using R. If what she said was right, I should win 2/3 of the time.

So here’s what we do. First define a variable to store the number of simulations and a vector to store the results  of each simulation.

# number of simulations
n <- 10000
# allocate memory to store results
x <- vector(mode = "integer", length = n)

Now we do the simulation. Picking the first door can be thought of as a single binomial trial with p = 1/3. We either pick the car (1) or we don't (0). We have a 2/3 chance of not picking the car. After we pick our door, we know another door will be opened to reveal a goat. That happens every time. That means we'll either switch to a goat or to a car. Hence the "if" statement: if we picked the car (1), switch to goat (0); else switch to car(1).

# simulation
for (j in 1:n){
 # pick a door; either goat=0 (p=2/3) or car=1 (p=1/3)
 x[j] <- rbinom(1,1,prob=(1/3)) 
 # switch; if 1, switch to 0, and vice versa
 if (x[j] == 1) {
 x[j] <- 0
 } else x[j] <- 1
}

When the simulation is done, calculate the percentage of times you switched to a car (1):

sum(x)/n # percentage of time you get car

When I ran this I got 0.6667, which is in excellent agreement with the theoretical answer of 2/3. Yay winning cars on game shows!

Machine Learning for Hackers, Chapter 3

In this chapter you build a naive Bayes classifier to classify an email as spam or not spam (or “ham” as they call it) using only the body of the message. The idea is to take a bunch of spam messages and build a Term Document Matrix (TDM). A TDM is a big grid where the row labels are the terms in the spam emails and the column headers are the spam emails. If row 21 is “viagra” and column 20 is the 20th spam email, their cell [21,20] displays the number of times the word “viagra” appeared in that email. Using the TDM they create a data frame which contains the terms and the percentage of documents in which a given term occurs. For example, of the 500 spam emails, the term “html” appears in 33% of the messages. That gets treated as a conditional probability. The probability of seeing the term “html” in an email given the email is spam is 0.33. The same process is carried out for emails that are not spam. So we have two data frames of terms (one for spam and one for not spam) and their estimated probability of occurrence.

To classify a new email as either spam or not spam we compare it to both data frames and see which terms in the email are in the respective data frame. If a term is in the date frame, we pull out its probability. If not, we assign it a low probability of 0.000001. We multiply the probabilities together along with the prior probability of the message being spam, which they set as 0.5 for starters.

For example, say we have a message with 5 words in the body. None of the words in the email appear in the spam data frame, so we calculate 0.5*0.000001^{5}. However say two of the words appear in the good email data frame, each with probability of 0.05 of occuring. So we calculate 0.5*0.05^{2}*0.000001^{3}. Since the latter calculation is greater than the former, we classify the email as not spam. That’s how you do naive Bayes classification.

Now if you’re like me you’re probably thinking, where’s the denominator? I mean, I’m no Bayesian expert, but I do know Bayes’s theorem has a denominator. Well it turns out the denominator is the same in both calculations so they drop it. (I have to give a shout out to this guy’s explanation. That’s where I went to figure it out.) You’ll also notice we’re multiplying probabilities as if they’re independent. That’s the naive assumption here.

Using the language of the probability, here’s Bayes’s Theorem for two terms, where A = message is spam, B = some word, and C = some other word

P(A|B \cap C ) = \frac{ P(B|A \cap C ) P(C|A) P(A) }{ P(B|C) P(C) }

For not spam this is…

P(A^{\prime} | B \cap C) = \frac{P(B|A^{\prime} \cap C)P(C|A^{\prime})P(A^{\prime})}{P(B|C)P(C)}

Notice the denominators are the same so we can drop them when comparing the two. Since we’re also assuming independence, the numerator changes to P(B|A)P(C|A)P(A) and P(B|A^{\prime})P(C|A^{\prime})P(A^{\prime}), respectively.

The book does not explain the classification algorithm in the detail I just did. I guess since the book is “for hackers” they didn’t want to get bogged down in math formulas and explanations of Bayes’s Theorem. I get that. But still I wanted to understand why it worked and not just how to do it R.

The R code itself for this chapter is not too hard to understand. A few nuggets I got out of it are…

The readLines() function will return each line of text as a separate element of a character vector.
The dir() function will return a listing of all file names in a directory.
The intersect() function will return the values in common between two vectors.

 

Machine Learning for Hackers, Chapter 2

Chapter 2 of Machine Learning for Hackers is called Data Exploration. It explains means, medians, quartiles, variance, histograms, scatterplots, things like that. It’s a quick and effective introduction to some basic statistics. If you know stats pretty well you can probably skip it, but I found it pretty useful for its intro to the ggplot2 package.

The data they explore is collection of heights and weights for males and females. You can download it here. The authors explore the data by creating histograms and scatterplots.

Here’s how they create a histogram of the heights:

ggplot(ch2, aes(x = Height)) + geom_histogram(binwidth=1)


Changing the binwidth parameter changes the size of the bins in inches. They also create kernel density estimates (KDE):

ggplot(ch2, aes(x = Height)) + geom_density()


Notice that line of code is the same as the histogram code but with a different function called after the “+”. Now that plots all the height data. They then show you how create two different KDEs for each gender:

ggplot(ch2, aes(x = Height, fill=Gender)) + geom_density() + facet_grid(Gender ~ .)


Histograms are good for one variable. If you want to explore two numeric variables, you bust out a scatterplot. Continuing with ggplot, they do the following:

ggplot(ch2, aes(x = Height, y = Weight)) + geom_point()


There’s your scatterplot. Next they show how to easily run a smooth prediction line through the plot with a confidence band:

ggplot(ch2, aes(x = Height, y = Weight)) + geom_point() + geom_smooth()


Then they show you again how to distinguish between gender:

ggplot(ch2, aes(x = Height, y = Weight, color = Gender)) + geom_point()


Using this last scatterplot with genders called out, they decide to give a sneak preview of machine learning using a simple classification model. First they code the genders as Male=1 and Female=0, like so:

ch2 <- transform(ch2, Male = ifelse(Gender == 'Male', 1, 0)

Then they use glm to create a logit model that attempts to predict gender based on height and weight:

logit.model <- glm(Male ~ Height + Weight, data=ch2, family=binomial)

Finally they redraw the scatterplot, but this time use the logit model parameters to draw a "separating hyperplane" through the scatterplot:

ggplot(ch2, aes(x = Height, y = Weight, color = Gender)) + geom_point() +
 stat_abline(intercept = - coef(logit.model)[1] / coef(logit.model)[3],
 slope = - coef(logit.model)[2] / coef(logit.model)[3],
 geom = 'abline', color = 'black')


The formula for the line is y = - \frac{\alpha}{\beta_{2}} - \frac{\beta_{1}}{\beta_{2}}. They don't tell you that in the book, but I thought I would state that for the record. Also the code in the book for this portion has typos and draws something different. What I provide above replicates Figure 2-31 in the book.

Again I felt like this was a solid intro to the ggplot package. And of course it's never bad to review the basics. "You have to be sound in the fundamentals" as sports analysts are fond of saying.

Machine Learning for Hackers, Chapter 1

I’ve started working through the book, Machine Learning for Hackers, and decided I would document what I learned or found interesting in each chapter. Chapter 1, titled “Using R”, gives an intro to R in the form of cleaning and prepping UFO data for analysis. I wouldn’t recommend this as a true introduction to R. It’s basically the authors preparing a dataset and stopping after each set of commands to explain what they did. I’ve been an R user for a few years now and I had to work a little to follow everything they did. No problem there. I’m always up for learning something new and long ago realized I will never know all there is to know about R. But I don’t think Chapter 1 of  Machine Learning for Hackers is good place to start if you’re new to R. Anyway, on to what I want to remember.

Number 1

The dataset they work with has two columns of dates in the format YYYYMMDD. They want to convert the dates to actual date types. To do that use the as.Date function. But before they can do that, they need to identify dates that are less than or more than 8 characters long. Here’s how they do it:

good.rows <- ifelse(nchar(ufo$DateOccurred) != 8 | 
nchar(ufo$DateReported) != 8, FALSE, TRUE)

This creates a vector of TRUE/FALSE values that is the same length as the original UFO dataset. If either of the two dates have character length not equal to 8, FALSE is returned. Otherwise TRUE. Next they use the vector of T/F values (called "good.rows") to subset the UFO data:

ufo <- ufo[good.rows,]

This keeps only those records (or rows) that have a corresponding TRUE value in the good.rows vector. So if row 25 in ufo has a TRUE value in "row" 25 of good.rows, that entire row is saved. The new ufo dataset will only have rows where both date values are exactly 8 characters long.

Number 2

They write a function to do some data cleaning. In the function they use the gsub function to remove leading white spaces from each character, as follows:

gsub("^ ","",split.location)

The first argument specifies a leading space. That's what we're looking for. The second argument is nothing. That's what we're replacing the leading space with. The final argument is the vector of character values we're searching.

Number 3

They want to keep all UFO records from the United States. To do this they want to match the state abbreviation column against a pre-set list of all 50 state abbreviations. If a match is found, the state values stays the same and the record is kept. Otherwise the state value is set to NA, which is eventually used to subset the dataframe for only US incidents. The neat thing I learned here is that R has a built-in vector of state abbreviations called state.abb.

Number 4

At one point they create a sequence of dates using the seq.Date function. Here is how to use it:

date.range <- seq.Date(from=as.Date("some earlier date"), 
to=as.Date("some later date"), by="month")

The by argument can also be years and days, and probably other things, too.

Number 5

Near the end they have a column of counts for number of sightings per month per year. For example, the number of sightings in Alaska in 1990-01. Sometimes there is a count, other times there is NA. They want to change the NA to 0. Here's how to do it:

all.sightings[,3][is.na(all.sightings[,3])] <- 0

That about does it. There are other good things in the first chapter, but these are the ones I really thought I could use and didn't want to forget. All the code for chapter 1 (and the other chapters) is available on github. Just beware the code you can download is different from the code in the book. However I think the code online is better. For example the book manually creates a list of all 50 state abbreviations, whereas the code on github uses the state.abb vector I mentioned in Number 3 above.

Handy R Function: expand.grid

In graduate school I took an experimental design class. As expected, we were assigned a group project that required us to design and analyze our own experiment. Two other students and I got together and decided to design an experiment involving paper airplanes. Not terribly useful, but it’s not as if we had a grant from the NSF. Anyway, we wanted to see what caused a plane to fly farther. Was it design? Was it paper weight? Was it both? We ended up running a 3 x 3 factorial experiment with the paper airplane throwers (us) as blocks.

We had three different paper airplane designs, three different paper weights, three people throwing, and replications. If memory serves, replications was three. In other words, each of us threw each of the 9 airplanes ( 3 designs x 3 paper weights) three times. That came out to 81 throws.

Now we had to randomly pick an airplane and throw it. How to randomize? Here’s a handy way to do that in R. First we use the expand.grid function. This creates a data frame from all combinations of  factors:

d <- expand.grid(throw=c(1,2,3), design=c(1,2,3), paper=c(20,22,28), person=c("Jack","Jane","Jill"))
head(d)
 throw design paper person
1 1 1 20 Jack
2 2 1 20 Jack
3 3 1 20 Jack
4 1 2 20 Jack
5 2 2 20 Jack
6 3 2 20 Jack

We see Jack will throw design 1 made of 20 lb. paper three times. Then he'll throw design 2 made of 20 lb. paper three times. And so on. Now to randomly sort the data frame:

d_rand <- d[sample(1:nrow(d)),]
d_rand
 throw design paper person
34 1 3 20 Jane
20 2 1 28 Jack
3 3 1 20 Jack
53 2 3 28 Jane
37 1 1 22 Jane
77 2 2 28 Jill

Very handy! And nifty, too.