# Machine Learning for Hackers, Chapter 12

In the final chapter the authors introduce Support Vector Machines (SVM). What SVM do are generalize linear decision boundaries for classification. Like the other chapters, this one avoids math and simply shows you how to use SVM in R. It then analyzes the spam mail dataset from chapter 3 using SVM, kNN and regularized logistic regression and compares their performance.

The first section of the chapter demonstrates SVM using a toy dataset. The dataset has two predictors (X and Y) and a response (a classification of 0 or 1). The dataset looks like this plotted:

I show this figure because the book is printed in black and white and you can’t tell class 0 from class 1 in the picture on page 276. That’s a problem that plagues this entire chapter. Anyway, what we have here is data that require a non-linear decision boundary. In other words we can’t draw a single straight line through the picture to indicate a boundary between the blue and red dots. This is where SVM comes in handy. Here’s the SVM code they use to analyze this dataset. As you’ll see they play around with the svm function from the e1071 package to show how changing certain parameters affects the quality of predictions. I didn’t get anything new out of this section that I felt compelled to blog about. It’s basically an extended example of how to use the svm function similar to what you might see on a help page. They do make use of the melt function from the reshape package, which is very cool and useful, but I already wrote about that in the Chapter 8 post.

The second section is “Comparing Algorithms”. They resurrect the spam dataset from chapter 3 and analyze it using SVM, kNN and regularized logistic regression. Recall the spam dataset was converted to a term document matrix (TDM) where the emails are the rows and the columns are terms. So email #1 would be row 1. In that row we would see the number of times the words in the column headers appeared. For example. the word “banner” appears o times in email #1. Column 1 of the spam dataset indicates whether or not the email is spam. They create testing and training sets and use those to create models and compare the different classification methods. Again, there’s nothing too fancy here where R programming is concerned, or at least nothing I haven’t already covered. They determine logistic regression works best for classifying spam email.

Now here’s something interesting worth noting. Instead of prepping the spam data as they did in Chapter 3, they skip directly to loading it into R as follows:

load('data/dtm.RData')

But how did they save that data as dtm.Rdata? By doing something like this:

save(dtm, file="dtm.RData")

…where dtm was a matrix. I say “matrix” because that’s what dtm.RData is when it gets loaded.

Another snippet worth noting is how they create their training and test sets. First they randomly select numbers as the indices for selecting the training data from the number of rows from the spam matrix. Then they take whatever numbers were not selected and make those the indices for the test data:

training.indices <- sort(sample(1:nrow(dtm), round(0.5 * nrow(dtm))))
test.indices <- which(! 1:nrow(dtm) %in% training.indices)

The dtm dataset has 3249 rows. So the sample function selects (without replacement) from the numbers 1 through 3249. The second argument to the sample function says how many to sample, which in this case is $0.5 \times 3249 = 1624.5$. That gets rounded to 1624 by the round function. So 1624 numbers are sampled without replacement from the numbers 1 through 3249. And then they get sorted in ascending order. Those are the indices for the training data. To get the test data, they use the value matching %in% operator to look for matches, but negate the matches to get all the numbers that were not selected with the sample function.

The code "1:nrow(dtm) %in% training.indices" returns a logical vector with 3249 elements. 1624 are TRUE, the rest are FALSE. The code "! 1:nrow(dtm) %in% training.indices" returns the opposite logical vector. Now 1624 are FALSE and the rest are TRUE. Finally, inserting that strip of code into the which function returns the indices of the vector that are FALSE. Therefore both training.indices and test.indices are vectors of numbers from 1:3249, with no overlap between them. For example:

> training.indices[1:10]
[1] 1 2 6 9 11 12 15 16 18 19
> test.indices[1:10]
[1] 3 4 5 7 8 10 13 14 17 24

These vectors of numbers are then used to select the training and test data as follows:

train.x <- dtm[training.indices, 3:ncol(dtm)]
train.y <- dtm[training.indices, 1]
test.x <- dtm[test.indices, 3:ncol(dtm)]
test.y <- dtm[test.indices, 1]

I thought that was good to go through and understand since all the functions in this section require data (or are capable of taking data) in this format. For example:

library('glmnet')
regularized.logit.fit <- glmnet(train.x, train.y, family = c('binomial'))
library('e1071')
linear.svm.fit <- svm(train.x, train.y, kernel = 'linear')
library('class')
knn.fit <- knn(train.x, test.x, train.y, k = 50)

And that does it for me! I'm ready to move on to another book.

# How to run R in batch mode in Windows 7

I recently purchased The Art of R Programming with a gift card I got for Christmas. Three pages into chapter 1 it talks about running R in “batch mode”. I had never done this and thought I would give it a try. It seems very useful, especially for large programs. Here’s the example program the book provides:

pdf("xh.pdf")
hist(rnorm(100))
dev.off()

You type that into a text file and save it as z.R. This program produces a histogram of 100 values drawn from a standard Normal distribution and outputs the histogram to a PDF file called xh.pdf. Now we want to submit that program to R without actually opening R. That means using the command line. But first we have to update our path if we’re using Windows. The book doesn’t mention this, hence the reason for this post. Here’s how you do it if you’re running 64-bit R on Windows 7:

1. go to Control Panel
2. click User Accounts and Family Safety
3. click User Accounts
4. click “Change my environment variables” in the left menu
5. Highlight Path and click Edit
6. Add semicolon at the end and then add “c:\Program Files\R\R-2.15.1\bin\x64” (without the quotes; update version if different from 2.15.1).
7. Click OK twice.
8. Log out and log back in to have the new path settings take effect.

Now go to the command prompt. Just go to Start and type “cmd”. Next go to the directory where your R program is located. If it’s in Documents, type “cd Documents”. Finally submit the following command:

R CMD BATCH z.R

That runs the R program and creates a PDF file in your working directory.

# Machine Learning for Hackers, Chapter 11

Are you a heavy Twitter user? Would you like to build a custom recommendation system in R that will recommend new Twitter feeds for you to follow? If you answered “yes” to both questions then this is the chapter for you. Here’s the code. Have fun.

Now I don’t use Twitter, so this chapter was not very interesting for me. I should probably use Twitter and get with it and join the 2010’s. But at the moment I don’t, so building a Twitter recommendation system in R doesn’t do me any good. I guess I could have followed along and built the recommendation system in the book for one of the authors, but I just couldn’t motivate. It’s a lot of code that requires new packages. Plus there’s a section that requires using a program outside of R called Gephi. It’s an extremely specific case study. It’s not like some of the other chapters where they introduce a general machine learning method, like kNN or PCA, and show an example of how to use it. There is nothing general about this chapter. At one point they do make use of hierarchical clustering to find similarities between Twitter users. Clustering is a common machine learning tactic. But they only give it a couple of pages. I hate to sound down on the chapter. It’s well done and the authors put a lot of work into it. Personally, though, I just couldn’t get excited about it.

Having said that, I did see some interesting things in the R code that I wanted to file away and remember. The first was creating an empty vector. In a loop they run, they sometimes get a null result and want to store an empty vector for that pass. They do this as follows:

y <- c(integer(0), integer(0))

What caught my eye was the way rbind handles those vectors. It ignores them. I could see that being potentially useful. Here's a toy demonstration:

x <- c(5, 4)
y <- c(integer(0), integer(0))
z <- c(3, 2)
rbind(x,y,z)
[,1] [,2]
x     5    4
z     3    2

Another snippet I wanted to note was how to identify unique values in a matrix. You use the unique function, which I was familiar with. What I didn't know was that you can do it for a matrix by calling the columns of the matrix, like this:

mat <- matrix(floor(runif(10,2,8)),5,2)
mat
[,1] [,2]
[1,]        4      3
[2,]        2      7
[3,]        7      2
[4,]        3      7
[5,]        5      7
> unique(c(mat[,1],mat[,2]))
[1] 4 2 7 3 5

This chapter also used a base function that I had never seen before called duplicated. It returns a vector of True/False values that tells you whether or not the value occurs previously in the vector. I was surprised I hadn't heard of it before. Here's a demo of duplicated:

x <- c(1:5,2:6)
x
[1] 1 2 3 4 5 2 3 4 5 6
duplicated(x)
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
!duplicated(x)
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
x[duplicated(x)]
[1] 2 3 4 5
x[!duplicated(x)]
[1] 1 2 3 4 5 6

Finally they make use of sapply in one of their functions and I wanted to document its use. The suite of apply functions always give me trouble for some reason. I get the gist of what they do, I just have a hard time remember why I would want to use one over the other. Anyway, what they do is use sapply to run a function over each row of a matrix. That seems like a useful thing to remember. First let's create a 100 x 3 data frame of values:

a <- rnorm(100,2,4)
b <- rnorm(100,3,4)
c <- rnorm(100,5,4)
df <- data.frame(a,b,c)

Now, let's say I wanted to create a True/False vector indicating which rows contained a value less than 0. Here's how I can do that using sapply:

sapply(1:nrow(df), function(i) ifelse(any(df[i,]<0),1,0))

The authors did something similar to that and then used the vector of T/F values to subset another vector. Of course if I wanted to subset my example data frame above to show only rows that contain a value less than o, then I could do this:

subset(df, a < 0 | b <0 | c < 0)

So that's the helpful R code I gleaned from Chapter 11. I feel like there should be more. It's a long chapter with a lot of code. But much of it either involves functions from specific packages or uses common R functions I've already documented or know about. Next up is chapter 12, the final chapter.

# Machine Learning for Hackers, Chapter 10

k-nearest neighbors (kNN) is the topic of chapter 10. The authors say it’s the most intuitive of the machine learning algorithms covered in the book and I guess I would agree, provided you understand the euclidean distance formula, understand what a distance matrix is, Â and know a little about how R works.

They provide a toy dataset (named “df”) to demonstrate how kNN works. It looks like this:

         X        Y Label
1 2.373546 5.398106     0
2 3.183643 4.387974     0
3 2.164371 5.341120     0
4 4.595281 3.870637     0
5 3.329508 6.433024     0
6 2.179532 6.980400     0

Think of the rows as points lying in an XY plane. Notice they also have labels. The dataset has 100 rows, with 50 labeled 0 and 50 labeled 1. To run the kNN algorithm, we need to create a distance matrix that stores the distance between all the points. So row 1 of the distance matrix would have the distances of all points from the first point (2.373546, 5.398106). Row 2 would have the distances of all points from (3.183643, 4.387974). And so on. The authors actually write a function to create the distance matrix, which is good for illustrating how the distance matrix is created, but I want to note you can use the built-in “dist” function for this:

distance <- as.matrix(dist(df, diag=TRUE, upper=TRUE))

I set the "diag" and "upper" parameters to TRUE in order to match the matrix they create. The result is this:

 distance[1:3,1:3]
1        2         3
1 0.0000000 1.294845 0.2167983
2 1.2948454 0.000000 1.3954937
3 0.2167983 1.395494 0.0000000

That's just a portion of the matrix. The full matrix is 100 x 100. The first row (or first column) refers to point 1 Â (2.373546, 5.398106). distance[1,1] is 0 because that's how far point 1 is from itself. distance[1,2] is 1.294845 because that's how far point 2Â (3.183643, 4.387974) is from point 1. The distance between two points A($x_{1},y_{1}$) and B($x_{2},y_{2}$) is calculated using the Euclidean distance formula:

$AB = \sqrt{(x_{1}-x_{2})^{2} + (y_{1}-y_{2})^{2}}$ $1.294845 = \sqrt{(2.373546-3.183643)^{2} + (5.398106-4.387974)^{2}}$

Once we have our distance matrix, we can find the k-nearest neighbors for a particular point by sorting the row from smallest to greatest distance. We can use the order function to help us do this. For example, let's find the 5 nearest neighbors for the first point:

indices <- order(distance[1,])[2:6]
indices
[1] 46 3 36 13 29
distance[1,indices]
Â  Â  Â  Â 46   Â  Â  Â  3        36        13        29
0.1796931 0.2167983 0.2212696 0.2916801 0.3561144

The order function returns the index positions of the nearest neighbors in the distance matrix. The index position can also be thought of as the identifying point number. So the closest neighbor of point 1 is point 46. The second-closest neighbor of point 1 is point 3. The command "distance[1,indices]" shows us the 5 closest points and their distances. What are their labels?

df[indices,'Label']
[1] 0 0 0 0 0

They're all 0's. If we wanted to predict the label of point 1, we could use majority rule. In this case since the 5-nearest neighbors are all labeled "0", we would predict point 1 is also labeled "0". (In truth, it is.) The authors carry this out for the entire dataset of 100 points and correctly predict the label for 93 of them given their 5-nearest neighbors.

The case study in this chapter is recommending R packages. The dataset is a list of 2487 packages, listed 50 times over for 50 programmers, with an indicator of whether or not that package is installed for the programmer. Using kNN they build a list of 25 recommendations for each programmer. Here's the code.

Next up is Chapter 11, Analyzing Social Graphs. It looks like the case study will involve building an engine to recommend people to follow on Twitter. I guess I'm a poopy stick-in-the-mud because this doesn't appeal to me. I'm not really into the Twitter. But I'll keep an open mind, slog through it and learn as much as I can.

# Machine Learning for Hackers, Chapter 9

In this chapter we learn a little about Multidimensional Scaling, or MDS for short. Like PCA in Chapter 8, MDS is a topic usually taught in multivariate statistics classes. I guess it’s classified as machine learning because of the computing required to do it, though the same can be said of almost any statistical method. The MDS that is presented is the classical type (as opposed to the non-metric type). The case study is exploring similarity in the US Senate based on votes. As it turns out, a proper MDS analysis shows Democrats and Republicans indeed vote differently and are quite polarized. Not surprising but better than anecdotal evidence from some political pundit.

Now if you’ve been following my series of posts on this book (and I’m pretty sure you haven’t) then you know I don’t summarize the chapter but rather highlight the R code that I found interesting and new. But for the first time I really don’t have much to highlight. It’s a short chapter and the data for the case study required very little preparation. I was actually able to follow all the R code without running to the help pages and Google. Still, though, I feel like I should document something from this chapter.

How about reading in a bunch of files. I feel like this something I will forget how to do one day, so I should probably write about it to help form some memory. First they define a variable that stores a path and then they call the list.files function to create a vector of file names:

data.dir <- "data/roll_call/"
data.files <- list.files(data.dir)

Next they use the lapply function to create one big list object that contains the contents of all those files, which happen to be Stata data sets:

rollcall.data <- lapply(data.files, function(f)
read.dta(paste(data.dir, f, sep=""), convert.factors=FALSE))

So the lapply function applies the read.dta function to each element in the data.files vector, which is file name. But in the read.dta function you'll notice the paste function is called to create a full path with the file name. Just dandy. That never fails to impress.

If you want to inspect some of the data, you can't just do head(rollcall.data). Since it's a list object you have to do something like this:

head(rollcall.data[[1]])

That shows the first 6 rows of the data stored in the first list element.

Later on they use the strsplit function to split first and last names of senators by comma. I'm pretty sure I've posted about this before, but it doesn't hurt to mention it again. Running a vector through strplit returns a list. For example:

 strsplit(congress$name[1], "[, ]") [[1]] [1] "SHELBY" "" "RIC" If I just want the last name, I can do this:  strsplit(congress$name[1], "[, ]")[[1]][1]
[1] "SHELBY"

Now the authors use the strsplit function in an sapply function, like this:

congress.names <- sapply(congress$name, function(n) strsplit(n, "[, ]")[[1]][1]) That gives us this:  head(congress.names) SHELBY, RIC HEFLIN, HOW STEVENS, TH MURKOWSKI, DECONCINI, MCCAIN, JOH "SHELBY" "HEFLIN" "STEVENS" "MURKOWSKI" "DECONCINI" "MCCAIN" Notice each element in the vector has a "name". Ugly, I think, though it doesn't matter in the context of what they do. But what if you wanted a clean vector of last names without the "names"? You can do this: congress.names <- sapply(congress$name, function(n)
strsplit(n, "[, ]")[[1]][1],USE.NAMES = FALSE)
[1] "SHELBY"    "HEFLIN"    "STEVENS"   "MURKOWSKI" "DECONCINI" "MCCAIN"

Much better. Notice the USE.NAMES = FALSE setting. Pretty self-explanatory what it does.

If you want to see the rest of their code including how they carry out the MDS analysis and MDS clustering plots, see their git hub page. For more on how exactly MDS works, see Chapter 14 of A Handbook of Statistical Analyses Using R.

# Machine Learning for Hackers, Chapter 8

This is a tidy chapter on principle component analysis (PCA). There are many fine tutorials on PCA to be found on the internet. This one is special because it doesn’t start with a nice data set. They have to first beat it into shape. They’re goal is to develop a market index based on 25 stocks. They will do PCA and use the first component to develop an index.

After they read in a CSV file with the date, they have to translate a column of dates into a date object. They use a package called “lubridate” for this, but it seems the as.Date function works just as well. This is what I did:

prices <- read.csv('data/stock_prices.csv')
prices <- transform(prices, Date = as.Date(Date))

Next they have to reshape the data. The prices dataset looks like this:

 head(prices)
Date     Stock Close
1 2011-05-25 DTE 51.12
2 2011-05-24 DTE 51.51
3 2011-05-23 DTE 51.47
4 2011-05-20 DTE 51.90
5 2011-05-19 DTE 51.91
6 2011-05-18 DTE 51.68

They want to transform this so the column headers are the Date and all the Stocks. They use the cast function from the reshape package.

library(reshape)
date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')

The result:

        Date   ADC   AFL ARKR AZPN  CLFD   DDR   DTE  ENDP FLWS     FR GMXR
1 2002-01-02 17.70 23.78 8.15 17.10 3.19 18.80 42.37 11.54 15.77 31.16 4.50
2 2002-01-03 16.14 23.52 8.15 17.41 3.27 18.55 42.14 11.48 17.40 31.45 4.37
3 2002-01-04 15.45 23.92 7.79 17.90 3.28 18.46 41.79 11.60 17.11 31.46 4.45
4 2002-01-07 16.59 23.12 7.79 17.49 3.50 18.30 41.48 11.90 17.38 31.10 4.38
5 2002-01-08 16.76 25.54 7.35 17.89 4.24 18.57 40.69 12.41 14.62 31.40 4.30
6 2002-01-09 16.78 26.30 7.40 18.25 4.25 18.41 40.81 12.27 14.27 31.05 4.25

Now after they do this they mention the new dataset has missing entries (NAs) that need to be removed. But they don't tell you how they found them. Furthermore they proceed to not just remove a row of NAs, but an entire column of a stock because it had a value on a day when all other stocks were NA! Here's what they do:

# remove a row and a column
prices <- subset(prices, Date != "2002-02-01")
prices <- subset(prices, Stock != "DDR")
date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')

That reduces the data set to 2366 rows and 25 columns. Instead of 25 stocks, we now have 24. (Remember, the first column is Date.) What confused me is why they didn't do something like this:

date.stock.matrix <- na.omit(date.stock.matrix)

That reduces the data set by two rows to 2365 rows but keeps all the columns. That means you have 25 stocks to develop your index instead of 24. But let's back up for a moment and think about how they found those missing values in the first place. They don't tell you in the book. They just say that they notice they have missing entries and need to remove them. Here's one way you could find out if you didn't know whether or not there were missing values. First get the dimensions of your data frame:

dim(date.stock.matrix)
[1] 2367 26

Now run the following to see if the dimensions change:

dim(date.stock.matrix[complete.cases(date.stock.matrix),])
[1] 2365 26

They do! Two rows have missing values. But which two rows? Try this

row.has.na <- apply(date.stock.matrix, 1, function(x){any(is.na(x))})
date.stock.matrix[row.has.na,]

That returns the two rows in the data frame with missing values, in this case rows 22 and 875. Again, we can use the na.omit code above to clear out those rows. I'm not sure why the authors didn't do this. When they run their PCA on their final data set with 24 stocks, the first component accounts for 29% of the variance. But when I run it on the data set with all 25 stocks and only two rows removed, the first component accounts for almost 33% of the variance. Not a huge improvement, but still improvement.

A couple of other "that was cool" nuggets I took from this chapter were the way they looked at the correlation matrix (to get a feel if PCA was warranted) and their use of the melt function from the reshape package. First the correlation trick:

cor.matrix <- cor(date.stock.matrix[,2:ncol(date.stock.matrix)])
corrs <- as.numeric(cor.matrix)
library(ggplot2)
ggplot(data.frame(Correlation = corrs), aes(x = Correlation, fill = 1))
+ geom_density() + opts(legend.position = 'none')

They calculate the correlation matrix of the closing values, convert the correlation matrix to a vector, and then make a density plot. This visually shows you that most of the correlations are positive and that PCA doesn't seem out of the question. I thought that was a neat trick.

Now the melt function. They use this function to convert a data set with both their market index data and the DJI such that both can be plotted over time. Let's look at before and after and see what it does.

Before melt:

 head(comparison)
Dates MarketIndex       DJI
1 2002-01-02 -1.0346886 -0.3251888
2 2002-01-03 -1.0265119 -0.2602872
3 2002-01-04 -1.0199864 -0.2027079
4 2002-01-07 -1.0230787 -0.2439139
5 2002-01-08 -0.9953525 -0.2744783
6 2002-01-09 -0.9873846 -0.3115893

After melt:

alt.comparison <- melt(comparison, id.vars='Dates')
names(alt.comparison) <- c('Dates', 'Index', 'Price')
 head(alt.comparison)
Dates       Index      Price
1 2002-01-02 MarketIndex -1.0346886
2 2002-01-03 MarketIndex -1.0265119
3 2002-01-04 MarketIndex -1.0199864
4 2002-01-07 MarketIndex -1.0230787
5 2002-01-08 MarketIndex -0.9953525
6 2002-01-09 MarketIndex -0.9873846

So what they want to do is plot the indices over time and put them both on the same graph. In the first data set, each row is a date with both index values on that date. The melt function says to identify the other two columns by date. Each column name in the first data set becomes a grouping factor in the second data set. The values that are in columns 2 and 3 of the first data set are now all in column three in the second data set. It's hard to explain in words but makes sense if you stare at the before-and-after long enough.

# Machine Learning for Hackers, Chapter 7

The title of this chapter, “Optimization: Breaking Codes”, sounds more exciting than it really is. The code breaking is attempted via an optimization algorithm called the Metropolis method. The authors show you how to encrypt text using a simple Caesar cipher and then how to implement the Metropolis method to attempt to decipher the message. Their test message is “here is some sample text”. They encrypt it using the Caesar cipher, so it looks like “ifsf jt tpnf tbnqmf ufyu “. The encryption rule is simply replacing every letter with the following letter in the alphabet. The letter a is replaced with b, b with c, and so on until you replace z with a. They manage to decipher their test text through the course of 50,000 Metropolis steps (on the 45,609th attempt). I suppose this is a fun demonstration of optimization. It just seemed a little outdated. I guess I was expecting to see them crack an actual message from World War II or something.

They start the chapter by demonstrating how to use the R optim function. For some reason this function has always confused me. Probably because I don’t use it that often. But I have to say I feel much better about it now. They give clear demonstrations of its use and visually show you how it works. Their example involves the height and weight dataset from chapter 2. First we do linear regression and find coefficients for the slope and intercept. Then we find the same coefficients using optim. The function minimized in the call to optim is a hand-built sum of squared errors function. I thought this was a great way to demonstrate how optim works. But after this, they do something puzzling: they continue to demonstrate how to use optim with Ridge Regression. Now Ridge Regression is a type of regression that helps remedy multicollinearity. If you have predictor variables that are highly correlated, you may get some weird coefficients in your linear modelling results. I’m talking about getting a negative value where you would expect a positive value, that sort of thing. Ridge regression helps address this by “modifying the method of least squares to allow biased estimators of the regression coefficients” (Kutner, et al., p. 432). Knowing that, it doesn’t make much sense to demonstrate Ridge regression on height and weight data. There’s only one predictor. I thought maybe they were going to do more with Ridge regression in this chapter, but they don’t. They move on to code breaking. It just seemed out of place to bring up Ridge regression in the context of single predictor regression.

I won’t go into the code breaking program. The code is available on the book’s git hub page. What I wanted to review for my own benefit is their function for encrypting text using the Caesar cipher. I thought it was a good example of using R’s indexing capabilities with list objects. First we need a vector with the English alphabet and two empty list objects for the ciphers: one for encrypting text and another for deciphering text:

english.letters <- letters
caesar.cipher <- list()
inverse.caesar.cipher <- list()

Now we create the ciphers:

# need to use index %% 26 + 1 to handle letter "z" at end
# 26 %% 26 + 1 = 0 + 1 = 1
for (index in 1:length(english.letters))
{
caesar.cipher[[english.letters[index]]] <- english.letters[index %% 26 + 1]
inverse.caesar.cipher[[english.letters[index %% 26 + 1]]] <- english.letters[index]
}

The "index %% 26 + 1" confused me for a couple of reasons. First off, the %% operator means modulus and I'm used to see something like 12 %% 5, where the first number is bigger than the second. The answer to that is of course 2. If you divide 12 by 5 you get remainder 2, and that's what the %% operator returns. But in the code above, you'll notice the "index" value is less than 26 on every iteration except the last. How do you calculate something like 2 %% 26? I wasn't sure so I had to google around and figure it out. It turns out the modulus operation a %% n is defined as follows:

$a \, mod \, n = a - \lfloor \frac{a}{n} \rfloor \times n$

So, 2 %% 26 is calculated like so:

$2 \, mod \, 26 = 2 - \lfloor \frac{2}{26} \rfloor \times 26 = 2 - \lfloor 0.0769 \rfloor \times 26 = 2 - (0 \times 26) = 2$

Once I was clear on that, I was confused why we needed to use the %% operator at all. Again if you look at the code above, you'll see "index %% 26 + 1". When index = 2, "index %% 26" gives 2. When index = 3, "index %% 26" gives 3. And so on. Why not just use "index + 1"? Well, when we get to the letter "z", we need to substitute "a". The index number for "a" is 1. But if we do "index + 1", then we would get 27 at the letter "z" since its index number is 26. That would not index anything in the 26 element "english.letters" vector and produce an error. Therefore we need to use the %% operator to allows us to recycle through the numbers 1 - 26. When index = 26 (i.e., when we get to the letter "z"), we do 26 %% 26 + 1, which equals 1 since 26/26 has remainder equal 0.

Now that I understand how the ciphers are generated, I can print them and have a look if I like:

print(caesar.cipher) # encrypt text
print(inverse.caesar.cipher) # decipher text

Since these are lists, we investigate with indices using double brackets. For example, to see what we substitute for letter "h", we do the following:

caesar.cipher[["h"]]

That returns "i". It seems like a fun extension to this would be to create a general function that allowed you to specify the substitution index. In the Caesar cipher, this is 1. It would be neat to specify, say, 13 so "a" becomes "n", "b" becomes "o", and so on. Anyway, moving on...

Now we use the ciphers we just made to encrypt text. To do this the authors build two functions. The first encrypts text letter-by-letter using the cipher you give it. The second function encrypts words using the first function. Let's have a look at these:

apply.cipher.to.string <- function(string, cipher)
{
output <- ''
for (i in 1:nchar(string))
{
output <- paste(output, cipher[[substr(string, i, i)]], sep='')
}
return(output)
}
apply.cipher.to.text <- function(text, cipher)
{
output <- c()
for (string in text)
{
output <- c(output, apply.cipher.to.string(string, cipher))
}
return(output)
}

The first function loops through each letter of a word (or "string"). Using the substr function, we extract each letter, use that letter to look up the substitution in the cipher, and "paste" the substituted letter to the previously substituted letter. At the conclusion of the loop we have encrypted the word using our cipher. The second function uses this first function to encrypt whole phrases. Here is how the authors encrypt the message "here is some sample text" using the Caesar cipher:

apply.cipher.to.text(c('here', 'is', 'some', 'sample','text'), caesar.cipher)

That gives us "ifsf" "jt" "tpnf" "tbnqmf" "ufyu" . If we wanted to decipher the text, we would use the inverse cipher as follows:

apply.cipher.to.text(c('ifsf', 'jt', 'tpnf', 'tbnqmf', 'ufyu'), inverse.caesar.cipher)

Again it would be a good exercise to generalize these functions to accept a whole text file and return an encrypted text file. Not that I'm going to do that, but it sounds like it would make a good project assignment in some programming class.

# Machine Learning for Hackers, Chapter 6

Chapter 6 covers orthogonal polynomial regression, cross-validation, regularization and a touch of logistic regression. The case study is predicting popularity of O’Reilly’s top 100 books based on their back cover descriptions. As usual I picked up some useful tricks from this chapter. Let’s get into it.

If you have a data frame called “df” with a variable $x$ and you want to add, say, $x^{2}$ to it (and name it “X2”), you can do the following:

df <- transform(df, X2 = X ^ 2)

Now it turns out R has a function called poly that does this for you:

poly(X , degree=2, raw=TRUE)

So you could leave your data frame as is (say it has two columns, Y and X) and do polynomial regression like so:

lm(Y ~ poly(X, degree=2, raw=TRUE), data = df)

That's straight-up raw polynomial regression. If you leave out the "raw = TRUE" parameter (or set it to FALSE, its default value), the poly function generates orthogonal polynomials. As the book states: "instead of naively adding simple powers of x, we add more complicated variants of x that work like successive powers of x, but aren't correlated with each other in the way that $x$ and $x^{2}$ are." Two vectors are orthogonal when their dot product equals 0. You can verify the poly produces orthogonal vectors. Here's a simple example:

x <- 1:10
p <- poly(x, 2)
p[,1] %*% p[,2]

That returns 0. I don't understand how R creates these orthogonal vectors. It returns a couple of items called "alpha" and "norm2" which I'm pretty sure have something to do with it. But the help pages and Google searches yielded nothing. (Though I will say this guy's blog post was useful.) Anyway, they illustrate polynomial regression using some toy data:

set.seed(1)
x <- seq(0, 1, by = 0.01)
y <- sin(2 * pi * x) + rnorm(length(x), 0, 0.1)
df <- data.frame(X = x, Y = y)
# raw polynomial regression
m1 <- lm(Y ~ poly(X,degree=5,raw=TRUE), data=df)
# orthogonal polynomial regression
m2 <- lm(Y ~ poly(X,degree=5), data=df)

If you want to make predictions using your models, do something like this:

d <- seq(0,1,by = 0.05)
p1 <- predict(m1, data.frame(X=d))
p2 <- predict(m2, data.frame(X=d))

If you want to see how your model visually performs, do something like this:

plot(x, y, xlab = "X", ylab = "Y", las = 1, xlim = c(0, 1))
lines(d, p1, col = "red")
lines(d, p2, col = "blue")

Finally if you want to update your orthogonal model, you can do something like this:

m1.2 <- lm(Y ~ poly(X,degree=5,raw=TRUE)[,-4], data=df)
m1.3 <- lm(Y ~ poly(X,degree=5,raw=TRUE)[,c(-2,-4)], data=df)

Now they only play around with orthogonal regression for the most part and they use cross validation. They present a general way to randomly split data into a testing and training sets and then loop through polynomial degrees evaluating RMSE at each step for both the training and testing sets. See their 14th and 15th code snippets for this code. I think it's quite useful.

After this they go into Regularization, which is some heavy-duty black box stuff. I guess they do a good job of showing you how to do it using the "glmnet" package. Still, though, I have no idea what's going on under the hood. I would feel somewhat disingenuous using this for real. I have no doubt it works. I'm not questioning their application or the glmnet package. I just feel uneasy using methods that I don't understand. I'm always afraid someone will ask me to explain how it works.

Finally they use Regularization and logistic regression to build a model for predicting whether or not a book will be in the top 50 (of the 100 best selling books) given its back-cover description. Instead of summarizing that analysis I wanted to write a little about how they hammer the data into shape to analyze it. First they read in the data, then they load the "tm" package and start calling functions:

ranks <- read.csv('data/oreilly.csv', stringsAsFactors = FALSE)
library('tm')
documents <- data.frame(Text = ranks\$Long.Desc.)
row.names(documents) <- 1:nrow(documents) # not sure why this is necessary
corpus <- Corpus(DataframeSource(documents))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('English'))

The last four lines is where the magic happens. First they make a "corpus". This is sort of like a list where the elements are text documents. The next line makes everything lower case. The next collapses multiple white space characters to a single blank. The last removes the most common words in English, like "the" and "an". When you're done, you can't view the corpus like you would a data frame. You can't just type the name at the command line and hit enter. You have to use the inspect() function, like this:

inspect(corpus) # all documents
corpus[[2]] # second document

Pretty cool. But it's still not ready for regression analysis. You have to produce a Document Term Matrix (DTM) from your corpus. This is done like so:

dtm <- DocumentTermMatrix(corpus)

Easy enough. But, again, how to view it? Use the inspect() function, like this:

# first five documents and first five terms
inspect(dtm[1:3,1:5])
# all documents and first five terms
inspect(dtm[,1:5])

The rows indicate the document (in this case numbered 1 - 100) and the column headers are the words (or terms). The values in the matrix tell you how many times the term appeared in a document. Here's a quick way to do some analysis on those terms:

# find those terms that occur at least five times
findFreqTerms(dtm, 5)

The last step to get everything ready for analysis is to convert the DTM into a matrix:

x <- as.matrix(dtm)

We do this because the glmnet function requires the predictors to be in a matrix, and the DTM represents the predictors. (The response is an indicator: top-50 book or not.) Now the dimension of the dtm matrix is 100 x 5605! That's a lot of columns! The authors have this to say: "Because our data set has more columns than rows, unregularized regression will always produce an overfit model. For that reason, we have to use some form of regularization to get any meaningful results." They probably could have eliminated some of the columns. When I inspected the DTM I saw terms such as "!)</li", "#</--", "#</,", "#</em", "#</em," and "&amp;". These are clearly just left over HTML codes. I can't see how they contribute to this analysis. But even if they were removed the matrix would still have many more columns than rows.