Monthly Archives: May 2013

Playing with R Markdown

I was playing with R Markdown in R Studio and thought I’d share my results. I can’t believe how easy this is to use! In R Studio, just go File…New…R Markdown. This opens a new template with some helpful code ready to use. This is your Rmd file. It’s basically a script file, except not only can you submit R code from this file, you can also save the output and graphs into one fancy HTML page. This is so much easier than copying-and-pasting R code into a Word Press post and saving/uploading/linking to images created in R. For example, see this post I wrote a while back. It took a long time to make.

Anyway, when you’re done, you click the Knit HTML button and your Rmd file is “knitted” into an HTML file. There’s a little bit of extra code you need to use to ensure proper formatting, but it’s super easy to use. Check out this page to see how easy it is to use. You just surround chunks of your R code in some simple markup.

So here’s what I did. First I worked through the Introductory session of Modern Applied Statistics with S (MASS) by Venables and Ripley. Here is my Rmd file and final output. Next I worked problem 4 from chapter 3 from Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill. Here’s the Rmd file and final output. The final output links are the cool parts. Those are just HTML files with embedded images. I uploaded the Rmd files so you could see the marked-up R code. As you’ll see, there’s not much there. If you want your R code and associated output and graphs to be nicely formatted in HTML, just surround it with

```....```{r}

That’s it. You can also create headers using double-asterisks.

A Logistic Regression Checklist

I recently read The Checklist Manifesto by Atul Gawande and was fascinated by how relatively simple checklists can improve performance and results in such complex endeavors as surgery or flying a commercial airplane. I decided I wanted to make a checklist of my own for Logistic regression. It ended up not being a checklist on how to do it per se, but rather a list of important facts to remember. Here’s what I came up with.

  • Logistic regression models the probability that y = 1, P(y_{i} = 1) = logit^{-1}(X_{i}\beta) where logit^{-1} = \frac{e^{x}}{1+e^{x}}
  • Logistic predictions are probabilistic. It predicts a probability that y = 1. It does not make a point prediction.
  • The function logit^{-1} = \frac{e^{x}}{1+e^{x}} transforms continuous values to the range (0,1).
  • Dividing a regression coefficient by 4 will give an upper bound of the predictive difference corresponding to a unit difference in x. For example if \beta = 0.33, then 0.33/4 = 0.08. This means a unit increase in x corresponds to no more than a 8% positive difference in the probability that y = 1.
  • The odds of success (i.e., y = 1) increase multiplicatively by e^{\beta} for every one-unit increase in x. That is, exponentiating logistic regression coefficients can be interpreted as odds ratios. For example, let’s say we have a regression coefficient of 0.497. Exponentiating gives e^{0.497} = 1.64 . That means the odds of success increase by 64% for each one-unit increase in x. Recall that odds = \frac{p}{1-p} . If our predicted probability at x is 0.674, then the odds of success are \frac{0.674}{0.326} = 2.07 . Therefore at x + 1, the odds will increase by 64% from 2.07 to 2.07(1.64) = 3.40. Notice that 1.64 = \frac{3.40}{2.70}, which is an odds ratio. The ratio of odds of x + 1 to x will always be e^{\beta}, where \beta is a logistic regression coefficient.
  • Plots of raw residuals from logistic regression are generally not useful. Instead it’s preferable to plot binned residuals “by dividing the data into categories (bins) based on their fitted values, and then plotting the average residual versus the average fitted value for each bin.” (Gelman & Hill, p. 97). Example R code for doing this can be found here.
  • The error rate is the proportion of cases in your model that predicts y = 1 when the case is actually y = 0 in the data. We predict y = 1 when the predicted probability exceeds 0.5. Otherwise we predict y = 0. It’s not good if your error rate equals the null rate. The null rate is usually the proportion of 0’s in your data. In other words, if you guessed all cases in your data are y = 1, then the null rate is the percentage you guessed wrong. Let’s say your data has 58% of y = 1 and 42% of y = 0, then the null rate is 42%. Further, let’s say you do some logistic regression on this data and your model has an error rate of 36%. That is, 36% of the time it predicts the wrong outcome. This means your model does only 4% better than simply guessing that all cases are y = 1.
  • Deviance is a measure of error. Lower deviance is better. When an informative predictor is added to a model, we expect deviance to decrease by more than 1. If not, then we’ve likely added a non-informative predictor to the model that just adds noise.
  • If a predictor x is completely aligned with the outcome so that y = 1 when x is above some threshold and y = 0 when x is below some threshold, then the coefficient estimate will explode to some gigantic value. This means the parameter cannot be estimated. This is an identifiability problem called separation.

Most of this comes from Chapter 5 of Data Analysis Using Regression and Multilevel/Hierarchical Models by Gellman and Hill. I also pulled a little from chapter 5 of An Introduction to Categorical Data Analysis by Agresti.