Machine Learning for Hackers, Chapter 5

This chapter covers linear regression. The case study is predicting page views for web sites given unique visitors, whether the site advertises, and whether the site is in English. The funny thing is that while they spend a lot of time explaining how to do linear regression in R, they never get around to explaining how to make a prediction using your regression model. (The name of the chapter is actually called “Regression: Predicting Page Views”.) I’ll explain how to do prediction in this post. But first I want to say the authors do a fine job of explaining the “big idea” of linear regression. It’s not an easy topic to cover in under 30 pages. If you’re new to the subject this is a great introduction. I highly recommend it. (And we all know my recommendation means a great deal.)

So let’s say you follow along with the book and a create a linear model to predict page views given unique visitors:

lm.fit <- lm(log(PageViews) ~ log(UniqueVisitors), data = top.1000.sites)

 

Notice the log transformation. This is necessary since the data is not normally distributed. The results of the regression are as follows:

Coefficients:
                   Estimate Std. Error t value   Pr(>|t|) 
(Intercept)        -2.83441    0.75201  -3.769   0.000173 ***
log(UniqueVisitors) 1.33628    0.04568  29.251    < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.084 on 998 degrees of freedom
Multiple R-squared: 0.4616, Adjusted R-squared: 0.4611 
F-statistic: 855.6 on 1 and 998 DF, p-value: < 2.2e-16

 

If we want to predict Page Views for a site with, say, 500,000,000 Unique Visitors, we use the coefficient estimates to calculate the following:

-2.83441 + 1.33628*log(500,000,000)

To easily do this in R we can submit code like this:

new <- data.frame(UniqueVisitors = 500000000)
exp(predict(lm.fit, new, interval = "prediction"))

 

This gives us not only a prediction (fit) but a prediction interval as well (lwr and upr):

fit           lwr        upr
1 24732846396 2876053856 2.12692e+11

 

Also notice we had to wrap the predict function in the exp function. That's because we ran the regression with the data transformed in the log scale. Once we make our prediction we need to transform it back to the original scale. I'm surprised the authors don't cover this. They do mention the predict function as a way to see what the regression model predicts for the original data values. But they don't talk about how to use your regression results to make a prediction for a new value that wasn't in your data. Perhaps that will be covered in future chapters.

Another minor complaint I have is a statement they make on page 150 in the 2nd paragraph: "...the Residual standard error is simply the RMSE of the model that we could compute using sqrt(mean(residuals(lm.git) ^ 2))" where lm.fit is the model object created when regressing Page Views on Unique Visitors. That's incorrect. The residual standard error is computed for this model with sqrt((sum(residuals(lm.fit)^2))/998), where 998 = 1000 - 2. In other words, instead of dividing the sum of the squared residuals by 1000 (i.e., taking the mean), we divide by the number of observations minus the number of coefficients in our model. In this case we have 1000 observations and two coefficients in our model (the intercept and slope). So we divide by 998.

While I didn't learn anything new about linear regression in this chapter, I did learn something about plotting lm model objects in R. I've known for a while you can easily create diagnostic plots by submitting the plot function with a model object as the argument. This creates 4 plots in a 2 x 2 grid. What I learned in this chapter is that you can specify which plot is produced by issuing an additional argument "which = ". For example, to just see a plot of residuals versus fitted values, you submit

plot(lm.fit, which = 1)

 

The book says issuing this statement will "plot the residuals against the truth." I think "truth" should say "fitted values". I'm not sure I've ever heard "fitted values" referred to as the "truth".

Anyway, as I said, I think this chapter is a solid introduction to linear regression. That's a good thing because it appears the next chapter steps it up a notch.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.