Machine Learning for Hackers, Chapter 8

This is a tidy chapter on principle component analysis (PCA). There are many fine tutorials on PCA to be found on the internet. This one is special because it doesn’t start with a nice data set. They have to first beat it into shape. They’re goal is to develop a market index based on 25 stocks. They will do PCA and use the first component to develop an index.

After they read in a CSV file with the date, they have to translate a column of dates into a date object. They use a package called “lubridate” for this, but it seems the as.Date function works just as well. This is what I did:

prices <- read.csv('data/stock_prices.csv')
prices <- transform(prices, Date = as.Date(Date))


Next they have to reshape the data. The prices dataset looks like this:

  Date     Stock Close
1 2011-05-25 DTE 51.12
2 2011-05-24 DTE 51.51
3 2011-05-23 DTE 51.47
4 2011-05-20 DTE 51.90
5 2011-05-19 DTE 51.91
6 2011-05-18 DTE 51.68


They want to transform this so the column headers are the Date and all the Stocks. They use the cast function from the reshape package.

date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')


The result:

1 2002-01-02 17.70 23.78 8.15 17.10 3.19 18.80 42.37 11.54 15.77 31.16 4.50
2 2002-01-03 16.14 23.52 8.15 17.41 3.27 18.55 42.14 11.48 17.40 31.45 4.37
3 2002-01-04 15.45 23.92 7.79 17.90 3.28 18.46 41.79 11.60 17.11 31.46 4.45
4 2002-01-07 16.59 23.12 7.79 17.49 3.50 18.30 41.48 11.90 17.38 31.10 4.38
5 2002-01-08 16.76 25.54 7.35 17.89 4.24 18.57 40.69 12.41 14.62 31.40 4.30
6 2002-01-09 16.78 26.30 7.40 18.25 4.25 18.41 40.81 12.27 14.27 31.05 4.25


Now after they do this they mention the new dataset has missing entries (NAs) that need to be removed. But they don't tell you how they found them. Furthermore they proceed to not just remove a row of NAs, but an entire column of a stock because it had a value on a day when all other stocks were NA! Here's what they do:

# remove a row and a column
prices <- subset(prices, Date != "2002-02-01")
prices <- subset(prices, Stock != "DDR")
date.stock.matrix <- cast(prices, Date ~ Stock, value = 'Close')


That reduces the data set to 2366 rows and 25 columns. Instead of 25 stocks, we now have 24. (Remember, the first column is Date.) What confused me is why they didn't do something like this:

date.stock.matrix <- na.omit(date.stock.matrix)


That reduces the data set by two rows to 2365 rows but keeps all the columns. That means you have 25 stocks to develop your index instead of 24. But let's back up for a moment and think about how they found those missing values in the first place. They don't tell you in the book. They just say that they notice they have missing entries and need to remove them. Here's one way you could find out if you didn't know whether or not there were missing values. First get the dimensions of your data frame:

[1] 2367 26


Now run the following to see if the dimensions change:

[1] 2365 26


They do! Two rows have missing values. But which two rows? Try this <- apply(date.stock.matrix, 1, function(x){any(})


That returns the two rows in the data frame with missing values, in this case rows 22 and 875. Again, we can use the na.omit code above to clear out those rows. I'm not sure why the authors didn't do this. When they run their PCA on their final data set with 24 stocks, the first component accounts for 29% of the variance. But when I run it on the data set with all 25 stocks and only two rows removed, the first component accounts for almost 33% of the variance. Not a huge improvement, but still improvement.

A couple of other "that was cool" nuggets I took from this chapter were the way they looked at the correlation matrix (to get a feel if PCA was warranted) and their use of the melt function from the reshape package. First the correlation trick:

cor.matrix <- cor(date.stock.matrix[,2:ncol(date.stock.matrix)])
corrs <- as.numeric(cor.matrix)
ggplot(data.frame(Correlation = corrs), aes(x = Correlation, fill = 1))
 + geom_density() + opts(legend.position = 'none')


They calculate the correlation matrix of the closing values, convert the correlation matrix to a vector, and then make a density plot. This visually shows you that most of the correlations are positive and that PCA doesn't seem out of the question. I thought that was a neat trick.

Now the melt function. They use this function to convert a data set with both their market index data and the DJI such that both can be plotted over time. Let's look at before and after and see what it does.

Before melt:

       Dates MarketIndex       DJI
1 2002-01-02 -1.0346886 -0.3251888
2 2002-01-03 -1.0265119 -0.2602872
3 2002-01-04 -1.0199864 -0.2027079
4 2002-01-07 -1.0230787 -0.2439139
5 2002-01-08 -0.9953525 -0.2744783
6 2002-01-09 -0.9873846 -0.3115893


After melt:

alt.comparison <- melt(comparison, id.vars='Dates')
names(alt.comparison) <- c('Dates', 'Index', 'Price')
       Dates       Index      Price
1 2002-01-02 MarketIndex -1.0346886
2 2002-01-03 MarketIndex -1.0265119
3 2002-01-04 MarketIndex -1.0199864
4 2002-01-07 MarketIndex -1.0230787
5 2002-01-08 MarketIndex -0.9953525
6 2002-01-09 MarketIndex -0.9873846


So what they want to do is plot the indices over time and put them both on the same graph. In the first data set, each row is a date with both index values on that date. The melt function says to identify the other two columns by date. Each column name in the first data set becomes a grouping factor in the second data set. The values that are in columns 2 and 3 of the first data set are now all in column three in the second data set. It's hard to explain in words but makes sense if you stare at the before-and-after long enough.

1 thought on “Machine Learning for Hackers, Chapter 8

  1. Pingback: » Machine Learning for Hackers, Chapter 12 Statistics you can Probably Trust

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.