Monthly Archives: October 2012

Handy R Function: expand.grid

In graduate school I took an experimental design class. As expected, we were assigned a group project that required us to design and analyze our own experiment. Two other students and I got together and decided to design an experiment involving paper airplanes. Not terribly useful, but it’s not as if we had a grant from the NSF. Anyway, we wanted to see what caused a plane to fly farther. Was it design? Was it paper weight? Was it both? We ended up running a 3 x 3 factorial experiment with the paper airplane throwers (us) as blocks.

We had three different paper airplane designs, three different paper weights, three people throwing, and replications. If memory serves, replications was three. In other words, each of us threw each of the 9 airplanes ( 3 designs x 3 paper weights) three times. That came out to 81 throws.

Now we had to randomly pick an airplane and throw it. How to randomize? Here’s a handy way to do that in R. First we use the expand.grid function. This creates a data frame from all combinations of  factors:

d <- expand.grid(throw=c(1,2,3), design=c(1,2,3), paper=c(20,22,28), person=c("Jack","Jane","Jill"))
 throw design paper person
1 1 1 20 Jack
2 2 1 20 Jack
3 3 1 20 Jack
4 1 2 20 Jack
5 2 2 20 Jack
6 3 2 20 Jack

We see Jack will throw design 1 made of 20 lb. paper three times. Then he'll throw design 2 made of 20 lb. paper three times. And so on. Now to randomly sort the data frame:

d_rand <- d[sample(1:nrow(d)),]
 throw design paper person
34 1 3 20 Jane
20 2 1 28 Jack
3 3 1 20 Jack
53 2 3 28 Jane
37 1 1 22 Jane
77 2 2 28 Jill

Very handy! And nifty, too.


Coursera Review: Computing for Data Analysis

I’m a huge fan of online classes. I like the self-paced nature, the flexibility, and the autonomy. With the right instructor, the right materials, and the right kind of student, I believe an online course can be far superior to the traditional in-person class. So with that attitude I signed up for Computing for Data Analysis on Coursera.

If you’re not familiar with Coursera, it’s a site that offers free online classes taught by big-time college professors. I’m talking MIT, Princeton, Stanford, etc. As far as I can tell, the classes being taught are actually designed for Coursera. In other words, this isn’t just videos of classes taught back in 2009. My class was taught by Roger Peng from Johns Hopkins.

Computing for Data Analysis was described as follows: “In this course you will learn how to program in R and how to use R for effective data analysis”. Sold. My background is in statistics, not programming. I’ve always known just enough about R to sometimes (eventually, frustratingly) get what I need. This seemed like the kind of class I needed to build a strong foundation in R and fill in the gaps in my knowledge.

The course ran for four weeks. Each week a set of lecture videos were unveiled with accompanying PDF slides. There was also a 10-question quiz each week that you could try 3 times. In addition there were two programming assignments, one due at end of week 2 and the other at the end of the course. No materials were required. You just needed to download and install R, which is free. According to Dr. Peng, over 40,000 people signed up for the class.

So how did it go? Pretty good! On a scale of 1 -10 I would give it an 8. The lectures were perfectly accessible. If you knew nothing about R or statistics or programming and were determined, you could easily follow along. The quizzes were fair and useful. The programming assignments were tough but doable. If you had questions, there was a lively discussion forum for help. I didn’t use it myself but I lurked to see what kinds of questions people had. It appeared that people who needed assistance were getting it. For the price (free!), this was an awesome class. It did what I hoped it would: fill in some gaps and show me better ways of doing things in R.

Minor points that kept me from giving the a class a 10:

  • The videos would frequently hang and stop if I tried to rewind a portion. Fortunately you could download them and watch them locally. But if you did that you lost the interactive mini-quizzes to test your comprehension.
  • The programming assignments centered around writing R functions, which is useful I guess, but seemed to detract from the actual data analysis. I felt like I spent too much time debugging functions and not actually doing data analysis.
  • Some of the lectures were a little too esoteric in my opinion. One particularly long one covered the distinction between S3 and S4 classes. No doubt an important topic for someone planning for a career in R programming, but probably over the head of most people in this introductory class.

Major points that I loved about the class:

  • The wonderfully lucid explanation of the *apply functions. That alone made the class worth it for me.
  • The fantastic introduction to regular expressions. This is something I’ve been wanting for a long time: a clear gentle intro to regular expressions.
  • The extensive overview of debugging techniques. I have quite a few R books, including R in a Nutshell, R Cookbook, and Statistical Analyses Using R. None of them seriously touch on debugging.

For posterity, I combined all the course slides into one PDF file: Computing for Data Analysis course slides

Download it and browse through it. If you’re interested in doing data analysis with R, this is a must-have.