Monthly Archives: August 2023

rms for the rest of us

For a few years now I’ve been chipping away at Regression Modeling Strategies by Frank Harrell. Between the exposition, the case studies, the notes, the references, the code and exercises, it’s a deep and rich source of statistical learning. The book’s associated R package, {rms}, is used throughout the book and is one of the reasons it can be tough sledding at times, especially if you’re like me and learned regression using the base R functions lm() and glm(). It’s not that {rms} is hard to use. In fact it’s quite easy to use. But using functions such as summary() and anova() with {rms} models produces very different output than what you get with a base R lm() model and can seem baffling to the uninitiated. In this post I hope to explain a little of what {rms} is doing by comparing it to the more traditional approaches in R so commonly taught in classrooms and textbooks.

Model fitting and summary output

To begin, let’s load the gala data from the {faraway} package. The gala data set contains data on species diversity on the Galapagos Islands. Below we use the base R lm() function to model the number of plant Species found on the island as function of the Area of the island (km\(^2\)), the highest Elevation of the island (m), the distance from the Nearest island (km), the distance from Santa Cruz island (km), and the area of the Adjacent island (square km). This is how I and thousands of others learned to do regression in R. Of course we use the familiar summary() function on our model object to see the model coefficients, marginal tests, the residual standard error, R squared, etc.

library(faraway)
data("gala")
m <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
        data = gala)
summary(m)
## 
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
##     data = gala)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -111.679  -34.898   -7.862   33.460  182.584 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.068221  19.154198   0.369 0.715351    
## Area        -0.023938   0.022422  -1.068 0.296318    
## Elevation    0.319465   0.053663   5.953 3.82e-06 ***
## Nearest      0.009144   1.054136   0.009 0.993151    
## Scruz       -0.240524   0.215402  -1.117 0.275208    
## Adjacent    -0.074805   0.017700  -4.226 0.000297 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared:  0.7658, Adjusted R-squared:  0.7171 
## F-statistic:  15.7 on 5 and 24 DF,  p-value: 6.838e-07

Now let’s fit the same model using the {rms} package. For this we use the ols() function. Notice we can use R’s formula syntax as usual. One additional argument we need to use that will come in handy in a few moments is x = TRUE. This stores the predictor variables with the model fit as a design matrix. Notice we don’t need to call summary() on the saved model object. We simply print it.

library(rms)
mr <- ols(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
            data = gala, x = TRUE)
mr
## Linear Regression Model
## 
## ols(formula = Species ~ Area + Elevation + Nearest + Scruz + 
##     Adjacent, data = gala, x = TRUE)
## 
##                  Model Likelihood    Discrimination    
##                        Ratio Test           Indexes    
## Obs       30    LR chi2     43.55    R2       0.766    
## sigma60.9752    d.f.            5    R2 adj   0.717    
## d.f.      24    Pr(> chi2) 0.0000    g      105.768    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -111.679  -34.898   -7.862   33.460  182.584 
## 
## 
##           Coef    S.E.    t     Pr(>|t|)
## Intercept  7.0682 19.1542  0.37 0.7154  
## Area      -0.0239  0.0224 -1.07 0.2963  
## Elevation  0.3195  0.0537  5.95 <0.0001 
## Nearest    0.0091  1.0541  0.01 0.9932  
## Scruz     -0.2405  0.2154 -1.12 0.2752  
## Adjacent  -0.0748  0.0177 -4.23 0.0003

The coefficient tables and summaries of residuals are the same. Likewise both return R-squared and Adjusted R-squared. The residual standard error from the lm() output is called sigma in the ols() output.

The ols output also includes a “g” statistic. This is Gini’s mean difference and measures the dispersion of predicted values. It’s an alternative to standard deviation. We can use the {Hmisc} function GiniMd to calculate Gini’s mean difference for the model fit with lm() as follows.

Hmisc::GiniMd(predict(m))
## [1] 105.7677

Whereas the summary output for lm() includes an F test for the null that all of the predictor coefficients equal 0, the ols() function reports a Likelihood Ratio Test. This tests the same null hypothesis using a ratio of likelihoods. We can calculate this test for the lm() model using the logLik() function. First we subtract the log likelihood of a model with no predictors from the log likelihood for the original model, and then multiply by 2. Apparently, according to Wikipedia, multiplying by 2 ensures mathematically that the test statistic converges to a chi-square distribution if the null is true. If the null is true, we expect this different of log likelihoods (or ratio, according to the quotient rule of logs) to be about 1. This statistic is very large. Clearly at least one of the predictor coefficients is not 0.

LRchi2 <- (logLik(m) - logLik(lm(gala$Species ~ 1)))*2
LRchi2
## 'log Lik.' 43.55341 (df=7)

If we wanted to calculate the p-value that appears in the ols() output, we can use the pchisq() function. Obviously the p-value in the ols() output is rounded.

pchisq(LRchi2, df = 7, lower.tail = FALSE)
## 'log Lik.' 2.60758e-07 (df=7)

Partial F tests and important measures

Now let’s use the anova() function on both model objects and compare the output. For the lm() object, we get sequential partial F tests using Type I sums of squares. Each line tests the null hypothesis that adding the listed predictor to the previous model without it explains no additional variability. So the Area line compares a model with just an intercept to a model with an intercept and Area. The Elevation line compares a model with an intercept and Area to a model with an intercept, Area, and Elevation. And so on. Small p-values are evidence against the null.

anova(m)
## Analysis of Variance Table
## 
## Response: Species
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Area       1 145470  145470 39.1262 1.826e-06 ***
## Elevation  1  65664   65664 17.6613 0.0003155 ***
## Nearest    1     29      29  0.0079 0.9300674    
## Scruz      1  14280   14280  3.8408 0.0617324 .  
## Adjacent   1  66406   66406 17.8609 0.0002971 ***
## Residuals 24  89231    3718                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Calling anova() on the ols() object produces a series of partial F tests using Type II sums of squares. Each line tests the null hypothesis that adding the listed predictor to a model with all the other listed predictors already in it explains no additional variability. So the Area line compares a model with all the other predictors to a model with all the other predictors and Area. The Elevation line compares a model with all the other predictors to a model with all the other predictors and Elevation. The line labeled REGRESSION is the F Test reported in the summary output for the lm() model.

anova(mr) 
##                 Analysis of Variance          Response: Species 
## 
##  Factor     d.f. Partial SS   MS           F     P     
##  Area        1   4.237718e+03 4.237718e+03  1.14 0.2963
##  Elevation   1   1.317666e+05 1.317666e+05 35.44 <.0001
##  Nearest     1   2.797576e-01 2.797576e-01  0.00 0.9932
##  Scruz       1   4.635787e+03 4.635787e+03  1.25 0.2752
##  Adjacent    1   6.640639e+04 6.640639e+04 17.86 0.0003
##  REGRESSION  5   2.918500e+05 5.837000e+04 15.70 <.0001
##  ERROR      24   8.923137e+04 3.717974e+03

To run these same Partial F tests for the lm() object we can use the base R drop1() function.

drop1(m, test = "F")
## Single term deletions
## 
## Model:
## Species ~ Area + Elevation + Nearest + Scruz + Adjacent
##           Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## <none>                  89231 251.93                      
## Area       1      4238  93469 251.33  1.1398 0.2963180    
## Elevation  1    131767 220998 277.14 35.4404 3.823e-06 ***
## Nearest    1         0  89232 249.93  0.0001 0.9931506    
## Scruz      1      4636  93867 251.45  1.2469 0.2752082    
## Adjacent   1     66406 155638 266.62 17.8609 0.0002971 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The {rms} package offers a convenient plot method for anova() objects that “draws dot charts depicting the importance of variables in the model” (from the anova.rms help page). The default importance measure is the chi-square statistic for each factor minus its degrees of freedom. There are several other importance measures available. See the anova.rms help page for the what argument.

plot(anova(mr))

IQR effect plots

Earlier we noted that we don’t use the summary() function on ols() model objects to see a model summary. That doesn’t mean there isn’t a summary() method available. There is, but it does something entirely different than the summary method for lm() objects. If we try it right now we get an error message:

summary(mr)
## Error in summary.rms(mr): adjustment values not defined here or with datadist for Area Elevation Nearest Scruz Adjacent

Notice what it says: “adjustment values not defined here or with datadist”. This tells us we need to define “adjustment values” to use the summary() function. Instead of summarizing the model, the {rms} summary function when called on {rms} model objects returns a “summary of the effects of each factor”. Let’s demonstrate what this means.

The easiest way to define adjustment values is to use the datadist() function on our data frame and then set the global datadist option using the base R options() function. Harrell frequently assigns datadist results to “d” so we do the same. (Also, the help page for datadist states, “The best method is probably to run datadist once before any models are fitted, storing the distribution summaries for all potential variables.” I elected to wait for presentation purposes.)

d <- datadist(gala)
options(datadist = "d")

If we print “d” we see adjustment levels for all variables in the model.

d
##                 Species Endemics      Area Elevation Nearest   Scruz  Adjacent
## Low:effect         13.0     7.25    0.2575     97.75   0.800  11.025    0.5200
## Adjust to          42.0    18.00    2.5900    192.00   3.050  46.650    2.5900
## High:effect        96.0    32.25   59.2375    435.25  10.025  81.075   59.2375
## Low:prediction      2.0     1.45    0.0390     47.35   0.445   0.490    0.1000
## High:prediction   319.1    85.40  782.6215   1229.40  40.205 193.905  782.6215
## Low                 2.0     0.00    0.0100     25.00   0.200   0.000    0.0300
## High              444.0    95.00 4669.3200   1707.00  47.400 290.200 4669.3200

Now let’s call summary() on our {rms} model object and see what it returns.

summary(mr)
##              Effects              Response : Species 
## 
##  Factor    Low     High    Diff.   Effect     S.E.    Lower 0.95 Upper 0.95
##  Area       0.2575  59.238  58.980  -1.411900  1.3225  -4.1413     1.3176  
##  Elevation 97.7500 435.250 337.500 107.820000 18.1110  70.4400   145.2000  
##  Nearest    0.8000  10.025   9.225   0.084353  9.7244 -19.9860    20.1550  
##  Scruz     11.0250  81.075  70.050 -16.849000 15.0890 -47.9910    14.2930  
##  Adjacent   0.5200  59.238  58.718  -4.392400  1.0393  -6.5374    -2.2473

This produces an interquartile range “Effects” summary for all predictors. For example, in the first row we see the effect of Area is about -1.41. To calculate this we predict Species when Area = 59.238 (High, 75th percentile) and when Area = 0.2575 (Low, 25th percentile) and take the difference in predicted species. For each prediction, all other variables are held at their “Adjust to” level as shown above when we printed the datadist object. In addition to the effect, the standard error (SE) of the effect and a 95% confidence interval on the effect is returned. In this case we’re not sure if the effect is positive or negative.

To calculate this effect measure using our lm() model object we can use the predict() function to make two predictions and then take the difference using the diff() function. Notice all the values in the newdata argument come from the datadist object above. The two values for Area are the “Low:effect” and “High:effect” values. The other values from the “Adjust to” row.

# How to get Area effect in summary(mr) output = -1.411900
p <- predict(m, newdata = data.frame(Area = c(0.2575, 59.238), 
                                Elevation=97.75, 
                                Nearest=3.05,
                                Scruz=46.65,
                                Adjacent=2.59))
p
##        1        2 
## 26.90343 25.49153
diff(p)
##         2 
## -1.411895

To get the standard error and confidence interval we can do the following. (Recall 58.980 is the IQR of Area.)

se <- sqrt((58.980 * vcov(m)["Area","Area"] * 58.980))
se
## [1] 1.32247
diff(p) + c(-1,1) * se * qt(0.975, df = 24)
## [1] -4.141340  1.317549

This is made a little easier using the emmeans() function in the {emmeans} package. We use the at argument to specify at what values we wish to make predictions for Area. All other values are held at their median by setting cov.reduce = median. Sending the result to the {emmeans} function contrast with argument “revpairwise” says to subtract the estimated means in reverse order. Finally we pipe into the confint() to replicate the IQR effect estimate for Area that we saw in the ols() summary.

library(emmeans)
emmeans(m, specs = "Area", at = list(Area = c(0.2575, 59.238)), 
        cov.reduce = median) |>
  contrast("revpairwise") |>
  confint()
##  contrast                estimate   SE df lower.CL upper.CL
##  Area59.238 - Area0.2575    -1.41 1.32 24    -4.14     1.32
## 
## Confidence level used: 0.95

There is also a plot method for {rms} summary objects. It plots the Effects and the 90, 95, and 99 percent confidence intervals using different shades of blue. Below we see the Area effect of -1.41 with relatively tight confidence intervals hovering around 0.

plot(summary(mr))

Non-linear effects

Harrell advocates using non-linear effects in the form of regression splines. This makes a lot of sense when you pause and consider how many effects in real life are truly linear. Not many. Very few associations in nature indefinitely follow a straight line relationship. Fortunately, we can easily implement regression splines in R and specify how much non-linearity we want to entertain. We do this either in the form of degrees of freedom or knots. More of both means more non-linearity. I personally think of it as the number of times the relationship might change direction.

One way to implement regression splines in R is via the ns() function in the {splines} package, which comes installed with R. Below we fit a model that allows the effect of Nearest to change directions three times by specifying df=3. We might think of this as sort of like using a 3-degree polynomial to model Nearest (but that’s not what we’re doing). The summary shows three coefficients for Nearest, neither of which have any interpretation.

library(splines)
m2 <- lm(Species ~ Area + Elevation + ns(Nearest, df = 3) + 
           Scruz + Adjacent, 
         data = gala)
summary(m2)
## 
## Call:
## lm(formula = Species ~ Area + Elevation + ns(Nearest, df = 3) + 
##     Scruz + Adjacent, data = gala)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.857 -28.285  -0.775  25.498 163.069 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            13.18825   27.39386   0.481   0.6350    
## Area                   -0.03871    0.02098  -1.845   0.0785 .  
## Elevation               0.35011    0.05086   6.883 6.51e-07 ***
## ns(Nearest, df = 3)1 -168.44016   68.41261  -2.462   0.0221 *  
## ns(Nearest, df = 3)2  -56.97623   57.67426  -0.988   0.3339    
## ns(Nearest, df = 3)3   38.61120   46.22137   0.835   0.4125    
## Scruz                  -0.13735    0.19900  -0.690   0.4973    
## Adjacent               -0.08254    0.01743  -4.734   0.0001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.11 on 22 degrees of freedom
## Multiple R-squared:  0.8247, Adjusted R-squared:  0.7689 
## F-statistic: 14.78 on 7 and 22 DF,  p-value: 5.403e-07

To decide whether we should keep this non-linear effect, we could use the anova() function to compare this updated more complex model to the original model with only linear effects. The null of the test is that both models are equally adequate. It appears there is some evidence that this non-linearity improves the model.

anova(m, m2)
## Analysis of Variance Table
## 
## Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + Elevation + ns(Nearest, df = 3) + Scruz + Adjacent
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1     24 89231                              
## 2     22 66808  2     22424 3.6921 0.04144 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To entertain non-linear effects in {rms} use the rcs() function, which stands for restricted cubic splines. Instead of degrees of freedom we specify knots. Three degrees of freedom corresponds to four knots.

mr2 <- ols(Species ~ Area + Elevation + rcs(Nearest, 4) + 
               Scruz + Adjacent, 
             data = gala, x = TRUE)
mr2
## Linear Regression Model
## 
## ols(formula = Species ~ Area + Elevation + rcs(Nearest, 4) + 
##     Scruz + Adjacent, data = gala, x = TRUE)
## 
##                  Model Likelihood    Discrimination    
##                        Ratio Test           Indexes    
## Obs       30    LR chi2     51.32    R2       0.819    
## sigma55.9556    d.f.            7    R2 adj   0.762    
## d.f.      22    Pr(> chi2) 0.0000    g      107.069    
## 
## Residuals
## 
##     Min      1Q  Median      3Q     Max 
## -90.884 -29.870  -3.715  28.019 163.497 
## 
## 
##           Coef      S.E.      t     Pr(>|t|)
## Intercept   14.1147   30.2230  0.47 0.6451  
## Area        -0.0368    0.0212 -1.74 0.0964  
## Elevation    0.3444    0.0512  6.73 <0.0001 
## Nearest      3.6081   16.9795  0.21 0.8337  
## Nearest'  -714.5143  891.8922 -0.80 0.4316  
## Nearest'' 1001.4929 1207.4895  0.83 0.4158  
## Scruz       -0.2286    0.1978 -1.16 0.2602  
## Adjacent    -0.0806    0.0175 -4.60 0.0001

Notice we get three coefficients for Nearest that differ in value from our lm() result. A brief explanation of why that is can be found on datamethods.org. This paper provides a deeper explanation.

Calling the {rms} anova() method on the model object returns a test for nonlinearity under the Nearest test, labeled “Nonlinear”. Notice the resulting p-value is slightly higher and exceeds 0.05.

anova(mr2)
##                 Analysis of Variance          Response: Species 
## 
##  Factor     d.f. Partial SS MS         F     P     
##  Area        1     9446.295   9446.295  3.02 0.0964
##  Elevation   1   141872.398 141872.398 45.31 <.0001
##  Nearest     3    20348.931   6782.977  2.17 0.1208
##   Nonlinear  2    20348.651  10174.326  3.25 0.0580
##  Scruz       1     4182.177   4182.177  1.34 0.2602
##  Adjacent    1    66204.433  66204.433 21.14 0.0001
##  REGRESSION  7   312198.651  44599.807 14.24 <.0001
##  ERROR      22    68882.715   3131.033

To replicate the {rms} anova() output using lm(), we need to change ns() arguments. The following R code comes from this hbiostat.org page

w <- rcs(gala$Nearest, 4)
kn <- attr(w, 'parms')
m2 <- lm(Species ~ Area + Elevation + ns(Nearest, knots = kn[2:3], 
                                         Boundary.knots = c(kn[1], kn[4])) 
         + Scruz + Adjacent, data = gala)
anova(m, m2)
## Analysis of Variance Table
## 
## Model 1: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + Elevation + ns(Nearest, knots = kn[2:3], Boundary.knots = c(kn[1], 
##     kn[4])) + Scruz + Adjacent
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1     24 89231                              
## 2     22 68883  2     20349 3.2495 0.05801 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which approach is better: ns() or rcs()? I don’t pretend to know and I’m not sure it really matters. The substance of the results doesn’t differ much. Yes, we saw a p-value jump from 0.04 to 0.05, but we know better than to make hard binary decisions based on a p-value.

Effect plots

Since interpreting non-linear effect coefficients is all but impossible, we turn to visualization. When fitting a non-linear model with lm(), the {effects} and {ggeffects} packages are fantastic for creating effect plots.

m2 <- lm(Species ~ Area + Elevation + ns(Nearest, df = 3) + 
           Scruz + Adjacent, gala)
library(effects)
Effect("Nearest", m2) |> plot()

library(ggeffects)
ggeffect(m2, "Nearest") |> plot()

In both cases we see the effect of Nearest is negative for values 0 – 20, but then seems to increase for values above 20.

To get a similar plot for an {rms} model, we can use the Predict() function (note the capital “P”; this is an {rms} function) and then pipe into plot(). You can also pipe into ggplot() and plotp() to get ggplot2 and plotly plots, respectively.

Predict(mr2, Nearest) |> plot() 

To get all effect plots in one go, we can use the {effects} function allEffects() with the lm() model.

allEffects(m2) |> plot()

To do the same with {rms}, we use the Predict() function with no predictors specified. Notice the y-axis has the same limits for all plots.

Predict(mr2) |> plot()

Diagnostics

A linear model makes several assumptions, two of which are constant variance of residuals and normality of residuals. The plot() method for lm() objects makes this easy to assess graphically.

op <- par(mfrow = c(2,2))
plot(m2)

par(op)

The left two plots assess constant variance using two different types of residuals. The upper right plot assess normality. The bottom right helps identify influential observations. The islands of Isabela and Santa Cruz stand out as either not being well fit by the model or unduly influencing the fit of the model.

We can use the plot.lm() method on the ols() model, but only the first one (i.e., which = 1).

plot(mr2, which = 1)

To create a QQ plot to assess normality of residuals we need to extract the residuals using the residuals() function and plot with qqnorm() and qqline()

r <- residuals(mr2, type="student")
qqnorm(r)
qqline(r)

To check for possible influential observations for an lm() model we can use the base R function influence.measures() and its associated summary() method. See ?influence.measures for a breakdown of what is returned. The columns that begin “dfb” are the DFBETA statistics. DFBETAS indicate the effect that deleting each observation has on the estimates of the regression coefficients. That’s why there’s a DFBETA for every coefficient in the model.

summary(influence.measures(m2))
## Potentially influential observations of
##   lm(formula = Species ~ Area + Elevation + ns(Nearest, df = 3) +      Scruz + Adjacent, data = gala) :
## 
##              dfb.1_ dfb.Area dfb.Elvt dfb.n(N,d=3)1 dfb.n(N,d=3)2 dfb.n(N,d=3)3
## Darwin        -0.02   0.07    -0.08     0.05         -0.10         -0.10       
## Fernandina     0.14   0.19    -0.15     0.02          0.00          0.03       
## Genovesa      -0.07  -0.08     0.12     0.13         -0.15         -0.43       
## Isabela        0.05 -18.76_*   4.55_*  -1.69_*       -1.12_*        0.94       
## SanCristobal  -0.09  -0.11     0.18    -0.18          0.22          0.49       
## SantaCruz      0.43  -1.03_*   1.58_*  -0.57         -0.96         -0.38       
## Wolf           0.00   0.00     0.00     0.01          0.00          0.00       
##              dfb.Scrz dfb.Adjc dffit    cov.r    cook.d   hat     
## Darwin         0.48     0.00     0.62     2.36_*   0.05     0.48  
## Fernandina    -0.07    -0.91    -1.51    27.71_*   0.30     0.95_*
## Genovesa       0.14    -0.10    -0.47     3.16_*   0.03     0.57  
## Isabela       -0.54    -0.78   -26.78_*   0.45    50.94_*   0.98_*
## SanCristobal  -0.18    -0.12     0.58     2.56_*   0.04     0.50  
## SantaCruz     -0.17    -1.16_*   2.33_*   0.01     0.36     0.21  
## Wolf           0.03     0.00     0.04     2.21_*   0.00     0.34

The {rms} package offers the which.influence() and show.influence() tandem to identify observations (via DFBETAS) that effect regression coefficients with their removal. Below we set cutoff = 0.4. An asterisk is placed next to a variable when any of the coefficients associated with that variable change by more than 0.4 standard errors upon removal of the observation. Below we see that removing the Isabela observation changes four of the coefficients. (Also, this is why we set x = TRUE in our call to ols(), so we could use these functions.) The values displayed are the observed values for each observation.

w <- which.influence(mr2, cutoff = 0.4)
show.influence(w, gala)
##             Count     Area Elevation Nearest  Scruz Adjacent
## Darwin          1     2.33       168    34.1 *290.2     2.85
## Fernandina      1   634.49      1494     4.3   95.3 *4669.32
## Isabela         4 *4669.32     *1707     0.7 * 28.1 * 634.49
## Pinta           3 *  59.56     * 777    29.1  119.6 * 129.49
## SanSalvador     3 * 572.33     * 906     0.2   19.8 *   4.89
## SantaCruz       3 * 903.82     * 864     0.6    0.0 *   0.52
## SantaFe         1    24.08       259   *16.5   16.5     0.52
## SantaMaria      5 * 170.92     * 640   * 2.6   49.2 *   0.10

More information

Got a question? Harrell runs a discussion board for the {rms} package.

Chapter 14 of these course notes by Thomas Love are also helpful.

And here’s a nice blog post called An Introduction to the Harrell“verse”: Predictive Modeling using the Hmisc and rms Packages by Nicholas Ollberding.

Eventually you may want to buy his book, Regression Modeling Strategies. It’s not cheap, but it’s not outrageous either. Considering how much information, advice, code, examples, and references it contains, I think it’s a bargain. It’s a text that will provide many months, if not years, of study.