Category Archives: Regression

Simple Linear Regression in R

I want to document in one place some of the ways I carry out simple linear regression in R. To illustrate I’ll use a problem from  the textbook Probability and Statistical Inference, 7th edition by Hogg and Tanis. The problem gives the times of 22 swimmers in the 50 yard freestyle at a hypothetical championship meet. For each swimmer there are two times: (1) their best time for the season, and (2) their time in the championship meet. We wish to regress the time in the meet on the best time of the season to see if there is a linear relationship between the two.

First we read in the data:

swim <- read.table("C:/My Documents/hogg_tanis_data/Exercise_6_12-10.txt",header=T)
swim <- swim[order(swim$x),]

The read.table function seems to work without fail for me. As long as the source data is in columns and there is clear separation between the columns, it works like a charm. If the first row has column headers, then set "header=TRUE". The second line of code simply orders the data by x, which is the best times of the season. Now we're ready to carry out the regression:

swim.mod <- lm(y~x,data=swim)
plot(swim$x,swim$y)
abline(swim.mod)

The first line carries out the regression. The second line creates a scatter plot of the pairs of times for the 22 swimmers. The third line adds the regression line to the plot. Now usually I would do the scatter plot first. That alone may tell you whether or not simple linear regression is appropriate. Of course in this case we know it's appropriate, because well, that's the chapter this problem comes from. Here's what we get:

Now let's examine the residuals to check our assumption of constant variance:

plot(swim$x,swim.mod$residuals)
abline(0,0)

Not too bad, I guess, though the variance seems to taper off in the extremes. This matches what we see in the scatterplot and regression line, where the points in the extremes are closer to the line than those points in the middle. This suggests that swimmers with the better times in season tend to have the better times in the meet. Same with the slower swimmers. In the middle, however, we see a lot of variation.

Now let's take a look at the formula for that regression line:

summary(swim.mod)

gives us the following output:

We see the estimated regression line is y = 7.18 + 0.67x and that both the intercept and slope are significantly different from 0. That's what the low p-values tell us. The intercept has no interpretation here since x is never 0. The slope says that two swimmers who differed by 1 second in "best time during season" will differ by 0.67 seconds in "time in the championship meet". Again these are estimates, so it's not worth getting too attached to the precision of the estimates in the output. That's why I reported the slope as 0.67 instead of 0.6705. How precise are these estimates? Let's look at the 95% confidence intervals for the parameters:

The lower bound of the slope is about 0.56 and the upper bound is about 0.88. We're reasonably confident the true slope is between those two bounds. The process we used captures the true slope 95% of the time. We'd like to think this time it worked and the true slope is in the interval of (0.45,0.88).

Finally let's add 90% confidence bands to our scatterplot and regression line. This will give us a general idea of how good our regression line is for predicting a mean "meet time" value for a given "best time in season" value:

conf.band <- predict(swim.mod, interval="confidence", level=0.90)
matplot(swim$x,conf.band, type="l", lty=c(1,2,2), ylab="predicted y")

The first line of code makes predictions of the means. The second line creates the plot. The matplot function plots vectors versus a matrix. In this case it plots the x values versus (1) the fitted values, (2) the upper limits, and (3) the lower limits. The result looks like this:

You can see it pinches in the middle where we have more data and thus more confidence about our predicted means. It appears on average that times improve in the final meet. For example, of people who reported a best time of 23 seconds during the season, the average of these same people at the meet looks to be about 22.5 (give or take a couple of tenths of seconds).

Transforming Data and the Box-Cox Transformation

When doing linear modelling, we make certain assumptions about the data. One such assumption is uniform variance. The fancy word for that is homoscedasticity. It means the variance around a regression line is roughly the same for all values of the predictor variable. One way to check that assumption is to fit a linear model and produce a scatterplot of residuals versus the predictor variable. If the data are homoscedastic then we should see no pattern in the scatter plot. They should fall in a uniform random pattern around 0. If there is a pattern, then our assumption of homoscedasticity is wrong and that’s a bad thing. Fortunately there is something we can do about it: transform the data. If you’re uncomfortable with the idea of transforming data, then just think of transforming inches to centimeters or Fahrenheit to Celsius. That’s the same thing. We apply a formula (or more technically, a function) to each data point and transform the data to a new scale. Then we re-do our regression analysis with the transformed data and hopefully see a nice uniform scatter of residuals versus the predictor.

But how do we transform the data? There are plenty of options. We can take the natural log of the response. Or we can take the square root or inverse square root. Or we can square it or cube it. We can even take the inverse-arcsine. So which option do you pick? You can use trial and error. Pick a handful of transformations and see which works best. Or just do a log transformation. That’s a pretty safe bet to at least improve the residual variance. Yes, but this is math and statistics and we want to be efficient! We want a systematic way to pick the “best” transformation. Well, I can’t say that I blame you. So here’s one way to systematically find the “best” transformation (as long as all the response data are positive).

First, we use the standardized Box-Cox Transformation:

t_{\lambda}(Y_{i})=\frac{Y_{i}^{\lambda}-1}{\lambda \dot{Y}^{\lambda - 1}}

where \dot{Y} is the geometric mean of the response variable. That is \dot{Y}=(\prod_{i=1}^{n}Y_{i})^{1/n}. So what we do is find the geometric mean of our untransformed response data, then use that in the standardized Box-Cox formula above with a specific lambda value, usually ranging from -2 to 2. That gives transformed response data which we then regress against our predictor. We make note of our Sum of Squares Error and then start over with a new value of lambda. The lambda that leads to the lowest Sum of Squares Error is declared our “best” transformation. For example, if lambda=0.5 gave us the lowest SSE, then we would transform our data using a square-root transformation.

Let’s do a real-life example (or at least something that sounds like real-life). In a medical study, researchers collected measurements from 25 young children. Of interest is the dependency of the “plasma level of a polyamine,” (the response) on age (the regressor or predictor).

Here’s our data:

Using R we fit a linear model using the original untransformed data. After that we plot the residuals versus age to check our assumption of homscedasticty:

Look at age 0. Not good.  Way too much spread. We can address that problem with a data transformation. Let’s find a good transformation using the method described above. First we find the geometric mean (Note: “level” = Plasma level):

# calculate geometric mean of response
gmean <- exp(mean(log(level)))

Notice we take the arithmetic mean of log transformed plasma levels, and then take the exponential of that. As my professor in grad school told me in his notes, "The advantage of this approach is to avoid a build-up of numerical error."

Now that we have that, we can write a little program in R to loop through values of lambda, transforming our data, fitting a model, and saving our Sum of Squares Error value:

sse <- c()
lambda<-c()
i <- 1
for (lam in seq(-1,1,0.1)){
  if (lam != 0){
  tY <- (level^lam - 1) / (lam*gmean^(lam-1))
  } else {
  tY <- log(level)*gmean
  }
  test <- anova.lm(lm(tY~age))
  sse[i] <- test['Residuals','Sum Sq']
  lambda[i] <- lam
  i <- i+1
}

(Notice that when lambda = 0 we have to use a different formula: log (response)*geometric mean.) After we run the program we check the results:

Do we see the lowest SSE value? It's 30.55961, which happens with a lambda value of -0.5. That tells us the optimal transformation is the inverse square-root transformation. So let's transform our response variable using that, re-fit our model and check our plot of residuals versus age:

model2 <- lm(level^(-0.5)~age)
plot(age,model2$residuals)

Now that looks a lot better, especially at age 0. But now that we have a model that was fit with our response data transformed, how does that affect our prediction? It means we need to transform back. The fitted model for the plasma data after transformation is plasma = 0.26803 + 0.04006*age. If we predict the plasma level for age 2.5 we get plasma = 0.36818. Compare that to our data above and you can see that doesn't make any sense. We need to apply the inverse of the inverse square-root to get it back to the original scale: 0.36818^(1/-0.5) = 7.376997. There we go.

Of course use some common sense when applying this method. If the SSE values are pretty close together around, say, 0.4 and 0.5, it may make more practical sense to pick 0.5 (ie, the square-root transformation). Again I quote my grad school professor: "Scientific explanations or even standard conventions often carry more meaning than strict minimization of SSE."

And by the way I should disclose the example above came from a course in Applied Linear Models I took at the University of Virginia. The only difference is the class used SAS. I decided to re-do the example in R.

Logistic Regression

Let’s pretend I take a random sample of 1200 American male adults and I ask them two questions:

  1. What’s your annual salary?
  2. Do you watch NFL football?

Further, let’s pretend they all answer, and they all answer honestly! Now I want to know if a person’s annual salary is associated with whether or not they watch NFL football. In other words, does the answer to the first question above help me predict the answer to the second question? In statistical language, I want to know if the continuous variable (salary) is associated with the binary variable (NFL? yes or no). Logistic regression helps us answer this question.

Before I go on I must say that whole books are devoted to logistic regression. I don’t intend to explain all of logistic regression in this one post. But I do hope to illustrate it with a simple example and hopefully provide some insight to what it does.

In classic linear regression, the response is continuous. For example, does a person’s weight help me predict diastolic blood pressure? The response is diastolic blood pressure and it’s a continuous variable. A good way to picture linear regression is to think of a scatter plot of two variables and imagine a straight line going through the dots that best represents the association between the two variables. Like this:

In my NFL/salary example, the response is “yes” or “no”. We can translate that to numbers as “yes”=1 and “no”=0.  Now we have a binary response. It can only take two values: 1 or 0. If we do a scatter plot of the two variables salary and NFL, we get something like this:

Trying to draw a line through that to describe association between salary and 0/1 seems pretty dumb. Instead what we do is try to predict the probability of taking a 0 or 1. And that’s the basic idea of logistic regression: develop a model for predicting the probability of a binary response. So we do eventually have “a line” we draw through the plot above, but it’s a smooth curve in the shape of an “S” that depicts increasing (or decreasing) probability of taking a 0 or 1 value.

The logistic model is as follows:

Pr(y=1) = logit^{-1}(\alpha + x \beta)

where logit^{-1}(x) = \frac{e^{x}}{1+e^{x}}

So we use statistical software to find the \alpha and \beta values (i.e., the coefficients), multiply it by our predictor (in our example, salary), and plug in to the inverse logit function to get a predicted probability of taking a 0 or 1. Let’s do that.

I generated some data in R as follows:

#simulate salaries between $23,000 and $100,000
salary <- round(runif(1200,23,100))

#simulate yes/no response to watching NFL
nfl <- c()
for (i in 1:1200){
    if (salary[i] < 61) {nfl[i] <- rbinom(1,1,0.75)}
    else {nfl[i] <- rbinom(1,1,0.25)}
}

The salaries are random in the range between 23 and 100. Obviously they represent thousands of dollars. The NFL response is not totally random. I rigged it so that a person making more money is less likely to watch the NFL. I have no idea if that's true or not, nor do I care. I just wanted to generate data with some association. Using R we can carry out the logistic regression as follows:

fit <- glm (nfl ~ salary, family=binomial(link="logit"))

We get the following results:

\alpha = 2.88 and \beta= -0.05.

Trust me when I say that those coefficients are statistically significant. Our model is Pr(NFL=1) = logit^{-1}(2.88 - 0.05 \times salary).

Now we can use them to predict probabilities. What's the probability someone earning $40,000/year watches the NFL?

\frac{e^{2.88 -0.05(40)}}{1 + e^{2.88 -0.05(40)}} = 0.73

What's the probability someone earning $80,000/year watches the NFL?

\frac{e^{2.88 -0.05(80)}}{1 + e^{2.88 -0.05(80)}} = 0.30

I'm leaving out tons of math and theory here. As I said at the outset, there's much more to logistic regression. But what I just demonstrated is the basic idea. With our model we can now draw a line through our plot:

The line represents probability. It decreases as salary increases. People with higher salaries are less likely to say they watch the NFL. At least according to the world of my made up data.